When to Optimize: Measure First, Premature Optimization, Readable-Then-Fast

22 min read

Quick Overview

The single most valuable performance skill is knowing when not to optimize. The discipline is always the same: write the clearest correct version first, measure it against a real workload, and only then spend effort on the parts the data proves are slow. This page is about that decision process — why guessing is almost always wrong, why premature optimization quietly costs you more than it saves, and how Rust changes the calculus compared to a garbage-collected runtime like Node.js.

Note: This file is the judgment page for the section. The mechanics live in its siblings: measure with Profiling, Flame Graphs, and Benchmarking; then apply the techniques in Optimization, Memory Layout, and Cache Efficiency. Read this one before any of those.

The Core Loop

Every credible optimization follows the same four steps, in order:

Write it readably and correctly. Idiomatic, boring code. Ship that first.
Measure with a representative workload. A benchmark, a profiler, or even a timed run — but real numbers, in a release build.
Find the proven hot spot. Optimize the thing that actually dominates, not the thing that looks expensive.
Re-measure to confirm the win — and that you didn’t break correctness. If the number didn’t move, revert and keep the simpler code.

Steps 1 and 4 are the ones people skip. The rest of this page is about why skipping them is a bad trade.

TypeScript/JavaScript Example

Here is a realistic aggregation: summing revenue per customer across a million orders. In TypeScript you would write the obvious version and move on — and you’d be right to, because the runtime hides allocation and copying behind the garbage collector.

1
interface Order {
2
  id: number;
3
  customer: string;
4
  totalCents: number;
5
}
6

7
function revenueByCustomer(orders: Order[]): Map<string, number> {
8
  const totals = new Map<string, number>();
9
  for (const o of orders) {
10
    totals.set(o.customer, (totals.get(o.customer) ?? 0) + o.totalCents);
11
  }
12
  return totals;
13
}
14

15
const orders: Order[] = [];
16
for (let i = 0; i < 1_000_000; i++) {
17
  orders.push({ id: i, customer: `customer-${i % 1000}`, totalCents: (i % 500) * 100 });
18
}
19

20
const t0 = performance.now();
21
const totals = revenueByCustomer(orders);
22
const t1 = performance.now();
23
console.log("distinct customers:", totals.size); // distinct customers: 1000
24
console.log("took ms:", (t1 - t0).toFixed(1));    // took ms: ~80 (machine-dependent)

Running this on Node v22 prints distinct customers: 1000. Note what you didn’t do: you didn’t think about whether o.customer is copied into the map, whether the Map resizes, or how ?? is compiled. The JIT and the GC absorb those decisions. That convenience is exactly why premature micro-optimization in JavaScript is usually pointless — you cannot see the costs, and the engine often rewrites your code anyway.

Rust Equivalent

The direct Rust translation looks almost identical, and that is the point: write the readable version first. Rust makes one cost visible that JavaScript hid — the owned String key — but you do not eliminate it preemptively. You ship this, then measure.

1
use std::collections::HashMap;
2
use std::time::Instant;
3

4
#[derive(Debug)]
5
struct Order {
6
    id: u64,
7
    customer: String,
8
    total_cents: u64,
9
}
10

11
/// Readable first: clear, obviously correct. Optimize later — if measurement says so.
12
fn revenue_by_customer(orders: &[Order]) -> HashMap<String, u64> {
13
    let mut totals: HashMap<String, u64> = HashMap::new();
14
    for order in orders {
15
        *totals.entry(order.customer.clone()).or_insert(0) += order.total_cents;
16
    }
17
    totals
18
}
19

20
fn main() {
21
    let orders: Vec<Order> = (0..1_000_000u64)
22
        .map(|i| Order {
23
            id: i,
24
            customer: format!("customer-{}", i % 1000),
25
            total_cents: (i % 500) * 100,
26
        })
27
        .collect();
28

29
    let start = Instant::now();
30
    let totals = revenue_by_customer(&orders);
31
    let elapsed = start.elapsed();
32

33
    println!("distinct customers: {}", totals.len());
34
    println!("revenue_by_customer took {elapsed:?}");
35
}

Running this with cargo run --release prints:

1
distinct customers: 1000
2
revenue_by_customer took 41.557209ms

Tip: That number is a single-run illustration on one machine, not a benchmark — wall-clock timings vary run to run. For numbers you can trust and compare across changes, use criterion. The point here is the workflow: you measured the readable version before touching anything.

The order.customer.clone() allocates a fresh String for the lookup key on every iteration — even when that customer is already in the map. That is the one line a Rust developer’s eye is drawn to. But “looks expensive” is a hypothesis, not a verdict. The next sections show how to decide whether it is worth fixing.

Detailed Explanation

Why “measure first” matters more in compiled languages

In JavaScript, your mental model of cost is approximate by necessity: a hidden-class deopt, an inline-cache miss, or a GC pause can dominate, and none of them are visible in the source. You learn to not guess because guessing is futile.

In Rust the costs are far more visible — clone() is a copy, Vec is contiguous, an Arc is a refcount — so it is tempting to think you can reason your way to the fast version from the source alone. You usually can’t. The optimizer (LLVM, via rustc) aggressively inlines, vectorizes, and constant-folds release builds. Code that looks expensive may compile to nothing; code that looks trivial may be the bottleneck. The only reliable signal is a measurement of the optimized binary.

The release-build trap

This is the most common first mistake, and it makes every other measurement meaningless. Consider a numeric kernel:

1
fn sum_of_squares(n: u64) -> u64 {
2
    // wrapping_* keeps the arithmetic well-defined in both debug and release.
3
    (0..n).fold(0u64, |acc, x| acc.wrapping_add(x.wrapping_mul(x)))
4
}
5

6
fn main() {
7
    let n = 100_000_000u64;
8
    let start = std::time::Instant::now();
9
    let result = sum_of_squares(n);
10
    println!("result = {result}");
11
    println!("elapsed = {:?}", start.elapsed());
12
}

The same binary, built two ways:

1
$ cargo run            # debug
2
result = 662921401752298880
3
elapsed = 735.919125ms
4

5
$ cargo run --release  # release
6
result = 662921401752298880
7
elapsed = 42.625µs

That is not a typo: the release build is roughly 17,000× faster here, because the optimizer recognizes the loop and collapses it into a handful of arithmetic operations, while the debug build executes 100 million iterations with overflow-check instrumentation. A timing taken in debug mode tells you nothing about production. Always benchmark --release. (Why the same numbers in both? wrapping_add/wrapping_mul are defined to wrap; if you used plain +/* instead, the debug build would panic with “attempt to add with overflow” while release would silently wrap — see Common Pitfalls.)

”Looks expensive” vs. “is expensive”: a worked decision

Suppose you have a reporting job that sorts events by score and renders each to a text line. Your instinct says the sort is the cost. Measure both candidates instead of trusting the instinct:

1
use std::fmt::Write as _;
2
use std::time::Instant;
3

4
#[derive(Clone)]
5
struct Event {
6
    user_id: u64,
7
    score: f64,
8
    label: String,
9
}
10

11
/// Tiny reusable timing helper: run `f`, print how long it took, return its result.
12
fn timed<T>(label: &str, f: impl FnOnce() -> T) -> T {
13
    let start = Instant::now();
14
    let out = f();
15
    eprintln!("{label}: {:?}", start.elapsed());
16
    out
17
}
18

19
fn build_events(n: usize) -> Vec<Event> {
20
    (0..n)
21
        .map(|i| Event {
22
            user_id: (i as u64).wrapping_mul(2654435761) % 1_000_000,
23
            score: ((i * 7 + 3) % 10_000) as f64 / 100.0,
24
            label: format!("event-{}", i % 64),
25
        })
26
        .collect()
27
}
28

29
/// Readable rendering: one fresh String per row.
30
fn render_naive(events: &[Event]) -> usize {
31
    let mut total = 0;
32
    for e in events {
33
        let line = format!("{}\t{:.2}\t{}", e.user_id, e.score, e.label);
34
        total += line.len();
35
    }
36
    total
37
}
38

39
/// Optimized rendering: one reused buffer, no per-row allocation.
40
fn render_buffered(events: &[Event]) -> usize {
41
    let mut buf = String::with_capacity(64);
42
    let mut total = 0;
43
    for e in events {
44
        buf.clear();
45
        write!(buf, "{}\t{:.2}\t{}", e.user_id, e.score, e.label).unwrap();
46
        total += buf.len();
47
    }
48
    total
49
}
50

51
fn main() {
52
    let events = build_events(500_000);
53

54
    // Candidate hot spots. MEASURE which dominates instead of guessing.
55
    let mut sortable = events.clone();
56
    timed("sort_by_score", || {
57
        sortable.sort_by(|a, b| b.score.partial_cmp(&a.score).unwrap());
58
    });
59
    let naive_bytes = timed("render_naive", || render_naive(&events));
60
    let buffered_bytes = timed("render_buffered", || render_buffered(&events));
61

62
    // Correctness guard: the optimization must not change the output.
63
    assert_eq!(naive_bytes, buffered_bytes);
64
    println!(
65
        "rendered {naive_bytes} bytes either way; top score = {:.2}",
66
        sortable[0].score
67
    );
68
}

A representative --release run:

1
sort_by_score: 31.613542ms
2
render_naive: 87.121792ms
3
render_buffered: 44.394333ms
4
rendered 10316314 bytes either way; top score = 99.99

The measurement overturns the instinct. The sort is not the bottleneck (~32 ms); the per-row format! — which allocates and frees half a million tiny Strings — is, at ~87 ms. The targeted fix (reuse one buffer with write!) nearly halves rendering to ~44 ms, and the assert_eq! proves it produces byte-for-byte the same output. Had you “optimized” the sort first, you’d have spent effort on the smaller cost and possibly traded away the readable sort_by. The buffer technique itself belongs to Optimization; the decision to apply it here came from measuring.

Key Differences

Question	TypeScript / JavaScript	Rust
Can I reason about cost from the source?	Rarely — JIT/GC hide it	Better, but the optimizer still surprises you; measure
Does build mode change my timings?	Minor (always JIT-warmed)	Enormous — debug vs. release can differ by 1000×+
Cost of a “readable first” default	GC absorbs allocations for you	You see allocations, but the compiler often elides them
What does premature optimization cost?	Wasted effort; engine may undo it	Wasted effort plus lost safety/readability (`unsafe`, hand-rolled loops)
Where is the real baseline win?	Algorithm + avoiding the event-loop stall	Often free: no GC pauses, contiguous data — before you tune anything
Tool for trustworthy numbers	`performance.now()`, `--prof`, clinic.js	criterion, samply/perf, flame graphs

The deeper point: Rust’s idiomatic, readable default is already fast relative to a GC’d runtime — no garbage collector, no boxed numbers, cache-friendly Vecs. You usually start from a much higher floor, which means there is even less reason to micro-optimize before measuring. The honest, full comparison lives in Performance vs. Node.js.

Note: “Premature optimization is the root of all evil” (Donald Knuth, 1974) is almost always quoted without its qualifier. The full sentence is: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” The skill is identifying the 3% — which requires measurement, not instinct.

Common Pitfalls

Pitfall 1: Benchmarking a debug build

cargo run and cargo test build without optimizations by default. A timing taken there is dominated by un-inlined function calls and overflow checks and bears no relation to production. Always measure with --release (or cargo bench, which is release by default). This is the same overflow-check behavior that makes a plain a + b panic in debug while wrapping in release:

1
fn sum_of_squares(n: u64) -> u64 {
2
    (0..n).map(|x| x * x).sum() // plain `*` and `sum`: overflow-checked in debug
3
}
4

5
fn main() {
6
    println!("{}", sum_of_squares(50_000_000));
7
}

In a debug build this aborts with a real panic — proof you were never measuring steady-state behavior:

1
thread 'main' panicked at .../library/core/src/iter/traits/accum.rs:149:1:
2
attempt to add with overflow
3
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Pitfall 2: Optimizing the thing that looks slow

As the worked example showed, the obvious-looking line (the sort) was not the hot spot. Profile before you touch anything: see Profiling and Flame Graphs. A change that doesn’t move a profiled number is not an optimization — it is just risk.

Pitfall 3: Reaching for cleverness that doesn’t even compile

A classic premature micro-optimization is mutating a collection “in place” while iterating it, to “avoid a second pass.” Rust’s borrow checker rejects it outright:

1
fn main() {
2
    let mut data = vec![1, 2, 3, 4, 5, 6];
3
    // does not compile (error[E0502]): "optimize" by removing while iterating
4
    for (i, &x) in data.iter().enumerate() {
5
        if x % 2 == 0 {
6
            data.remove(i);
7
        }
8
    }
9
    println!("{data:?}");
10
}

The real compiler error:

1
error[E0502]: cannot borrow `data` as mutable because it is also borrowed as immutable
2
 --> src/main.rs:6:13
3
  |
4
4 |     for (i, &x) in data.iter().enumerate() {
5
  |                    -----------------------
6
  |                    |
7
  |                    immutable borrow occurs here
8
  |                    immutable borrow later used here
9
5 |         if x % 2 == 0 {
10
6 |             data.remove(i);
11
  |             ^^^^^^^^^^^^^^ mutable borrow occurs here

The clean version is also the fast one — a single in-place pass via the standard library:

1
fn main() {
2
    let mut data = vec![1, 2, 3, 4, 5, 6];
3
    data.retain(|&x| x % 2 != 0); // readable AND fast
4
    println!("{data:?}"); // [1, 3, 5]
5
}

This is the recurring lesson: in Rust the idiomatic, readable form is frequently the fastest one too, so “clever” rewrites cost readability (and sometimes unsafe) for nothing.

Pitfall 4: Trusting one wall-clock number

A single Instant::now() reading is noisy — CPU frequency scaling, the OS scheduler, and cold caches all move it. Use it for rough direction only. For a number you can put in a PR description or compare across commits, use criterion, which warms up, takes many samples, and reports a confidence interval plus regression detection.

Pitfall 5: Optimizing before you have a representative workload

Tuning against a 10-element array tells you about constant factors that vanish at scale, and nothing about the algorithm. Measure with input that resembles production in size and shape (key distribution, string lengths, cardinality). The wrong workload produces confident, wrong conclusions.

Best Practices

Write the boring version first. Idiomatic iterators, String/Vec, clone() where it keeps the code clear. Ship it. (See Functions and Collections for what “idiomatic” looks like.)
Set a target. “Fast enough” needs a number: a p99 latency, a throughput floor, a memory ceiling. Without a target you will optimize forever and ship never.
Always measure in --release, and prefer criterion over hand-rolled timers for anything you’ll act on.
Profile to find the hot spot, then optimize only that. Profiling and flame graphs point at the 3% that matters.
Re-measure after every change, and keep a correctness check (an assert_eq!, a snapshot test) so a faster-but-wrong version can’t sneak through.
Prefer algorithmic wins over micro-tuning. Going O(n²) → O(n log n) beats any amount of constant-factor fiddling; a HashMap or HashSet lookup beats a hand-tuned linear scan.
Treat allocation as the default suspect — but confirm it. Needless allocation is the most common real Rust hot spot, which is why Optimization leads with it. Even so, confirm with a profiler before rewriting.
Keep the simpler version if the win is marginal. If a change is within noise, revert it. Maintainability is a performance feature: code you can change quickly is code you can fix and speed up later.
Lean on the free baseline. No GC pauses and cache-friendly data give Rust a high starting floor; often “readable Rust” already meets the target and you optimize nothing. See Performance vs. Node.js.

Real-World Example

A small, reusable measurement harness is worth more than any single optimization, because it turns “I think this is faster” into “this is 1.9× faster on our workload.” Here is a self-contained one that compares two implementations of a word-frequency counter against a realistic corpus and proves they agree before reporting timings — exactly the loop you’d run before deciding whether the “optimized” version is worth keeping.

1
use std::collections::HashMap;
2
use std::time::Instant;
3

4
/// Reusable timing helper: run `f`, print elapsed, return the result.
5
fn timed<T>(label: &str, f: impl FnOnce() -> T) -> T {
6
    let start = Instant::now();
7
    let out = f();
8
    println!("{label}: {:?}", start.elapsed());
9
    out
10
}
11

12
/// Naive: allocate a fresh owned, lowercased key for EVERY word — even repeats.
13
fn count_naive(text: &str) -> HashMap<String, u64> {
14
    let mut counts = HashMap::new();
15
    for word in text.split_whitespace() {
16
        let key = word.to_lowercase(); // always allocates
17
        *counts.entry(key).or_insert(0) += 1;
18
    }
19
    counts
20
}
21

22
/// Fast: look up by borrowed &str first; allocate the owned key only on first insert.
23
fn count_fast(text: &str) -> HashMap<String, u64> {
24
    let mut counts: HashMap<String, u64> = HashMap::new();
25
    for word in text.split_whitespace() {
26
        if let Some(v) = counts.get_mut(word) {
27
            *v += 1;
28
        } else {
29
            counts.insert(word.to_string(), 1);
30
        }
31
    }
32
    counts
33
}
34

35
fn main() {
36
    // Representative workload: a large, already-lowercase corpus.
37
    let base = "the quick brown fox jumps over the lazy dog the fox runs ";
38
    let text = base.repeat(200_000);
39

40
    let a = timed("count_naive", || count_naive(&text));
41
    let b = timed("count_fast", || count_fast(&text));
42

43
    // Correctness guard BEFORE we trust the speedup.
44
    let mut ka: Vec<_> = a.iter().collect();
45
    let mut kb: Vec<_> = b.iter().collect();
46
    ka.sort();
47
    kb.sort();
48
    assert_eq!(ka, kb, "the two implementations must agree");
49

50
    println!("distinct words: {}, 'the' = {}", a.len(), a["the"]);
51
}

A representative --release run:

1
count_naive: 103.63775ms
2
count_fast: 56.74ms
3
distinct words: 9, 'the' = 600000

The measurement justifies the change: avoiding a per-word allocation when the word is already lowercase and already counted roughly halves the time (~104 ms → ~57 ms), and the assert_eq! proves the result is identical. This is the complete loop in miniature — readable baseline, representative workload, measured hot spot (the to_lowercase() allocation), targeted fix, confirmed win, preserved correctness. The borrowing technique that made it faster is covered in depth in Optimization; the decision to apply it is what this page is about.

Note: In production you would graduate from timed to criterion for statistically sound numbers, and from “two functions in main” to a profiler (Profiling) to find which function to look at in the first place. timed is the gateway drug, not the destination.

Exercises

Exercise 1: Build a Timing Harness and Compare Two Approaches

Difficulty: Beginner

Objective: Practice the measure-first loop with a reusable timed helper, and confirm both approaches produce the same answer.

Instructions: Write a timed<T>(label, f) helper that prints how long f took and returns its result. Use it to compare two ways of summing the multiples of 3 below n: one that collects into a Vec first, and one that sums lazily in a single pass. Run with cargo run --release on n = 20_000_000 and assert_eq! that both agree.

1
use std::time::Instant;
2

3
fn timed<T>(label: &str, f: impl FnOnce() -> T) -> T {
4
    // TODO: time `f`, print "{label}: {elapsed:?}", return its result
5
    todo!()
6
}
7

8
fn sum_collect(n: u64) -> u64 {
9
    /* ??? collect multiples of 3 into a Vec, then sum */
10
    todo!()
11
}
12

13
fn sum_lazy(n: u64) -> u64 {
14
    /* ??? sum multiples of 3 in one lazy pass */
15
    todo!()
16
}
17

18
fn main() {
19
    let n = 20_000_000u64;
20
    let a = timed("sum_collect", || sum_collect(n));
21
    let b = timed("sum_lazy", || sum_lazy(n));
22
    assert_eq!(a, b);
23
    println!("both agree: {a}");
24
}

Solution

1
use std::time::Instant;
2

3
fn timed<T>(label: &str, f: impl FnOnce() -> T) -> T {
4
    let start = Instant::now();
5
    let out = f();
6
    println!("{label}: {:?}", start.elapsed());
7
    out
8
}
9

10
// Approach A: collect into a Vec, then sum (one intermediate allocation).
11
fn sum_collect(n: u64) -> u64 {
12
    let v: Vec<u64> = (0..n).filter(|x| x % 3 == 0).collect();
13
    v.iter().sum()
14
}
15

16
// Approach B: one lazy pass, no intermediate Vec.
17
fn sum_lazy(n: u64) -> u64 {
18
    (0..n).filter(|x| x % 3 == 0).sum()
19
}
20

21
fn main() {
22
    let n = 20_000_000u64;
23
    let a = timed("sum_collect", || sum_collect(n));
24
    let b = timed("sum_lazy", || sum_lazy(n));
25
    assert_eq!(a, b);
26
    println!("both agree: {a}");
27
}

A representative --release run:

1
sum_collect: 25.619667ms
2
sum_lazy: 13.340375ms
3
both agree: 66666663333333

The lazy version is roughly twice as fast because it never materializes the multiples into a Vec — there is no intermediate allocation to fill and walk. Both produce 66666663333333, so the faster version is also correct. (Numbers are illustrative and machine-dependent; the lesson is the workflow, and that “readable lazy iterators” won without any cleverness.)

Exercise 2: Find and Fix the Proven Hot Spot

Difficulty: Intermediate

Objective: Use measurement to locate a hot spot, apply a targeted fix, and prove correctness was preserved.

Instructions: You are given count_naive, which counts word frequencies but allocates a brand-new owned key for every word via to_lowercase(), even for repeats. The corpus is already lowercase. Write count_fast that looks up by borrowed &str first and only allocates the owned key on first insert. Time both with the timed helper from Exercise 1 against base.repeat(200_000), and assert_eq! their results (sort the entries first, since HashMap order is arbitrary).

1
use std::collections::HashMap;
2
use std::time::Instant;
3

4
fn timed<T>(label: &str, f: impl FnOnce() -> T) -> T {
5
    let start = Instant::now();
6
    let out = f();
7
    println!("{label}: {:?}", start.elapsed());
8
    out
9
}
10

11
fn count_naive(text: &str) -> HashMap<String, u64> {
12
    let mut counts = HashMap::new();
13
    for word in text.split_whitespace() {
14
        let key = word.to_lowercase();
15
        *counts.entry(key).or_insert(0) += 1;
16
    }
17
    counts
18
}
19

20
fn count_fast(text: &str) -> HashMap<String, u64> {
21
    // TODO: look up by &str first; only allocate the owned key on first insert
22
    todo!()
23
}
24

25
fn main() {
26
    let base = "the quick brown fox jumps over the lazy dog the fox runs ";
27
    let text = base.repeat(200_000);
28
    let a = timed("count_naive", || count_naive(&text));
29
    let b = timed("count_fast", || count_fast(&text));
30
    let mut ka: Vec<_> = a.iter().collect();
31
    let mut kb: Vec<_> = b.iter().collect();
32
    ka.sort();
33
    kb.sort();
34
    assert_eq!(ka, kb);
35
    println!("distinct words: {}, 'the' = {}", a.len(), a["the"]);
36
}

Solution

1
use std::collections::HashMap;
2
use std::time::Instant;
3

4
fn timed<T>(label: &str, f: impl FnOnce() -> T) -> T {
5
    let start = Instant::now();
6
    let out = f();
7
    println!("{label}: {:?}", start.elapsed());
8
    out
9
}
10

11
fn count_naive(text: &str) -> HashMap<String, u64> {
12
    let mut counts = HashMap::new();
13
    for word in text.split_whitespace() {
14
        let key = word.to_lowercase(); // allocates on every word
15
        *counts.entry(key).or_insert(0) += 1;
16
    }
17
    counts
18
}
19

20
fn count_fast(text: &str) -> HashMap<String, u64> {
21
    let mut counts: HashMap<String, u64> = HashMap::new();
22
    for word in text.split_whitespace() {
23
        // Borrowed lookup: no allocation when the word is already present.
24
        if let Some(v) = counts.get_mut(word) {
25
            *v += 1;
26
        } else {
27
            counts.insert(word.to_string(), 1); // allocate only on first insert
28
        }
29
    }
30
    counts
31
}
32

33
fn main() {
34
    let base = "the quick brown fox jumps over the lazy dog the fox runs ";
35
    let text = base.repeat(200_000);
36

37
    let a = timed("count_naive", || count_naive(&text));
38
    let b = timed("count_fast", || count_fast(&text));
39

40
    let mut ka: Vec<_> = a.iter().collect();
41
    let mut kb: Vec<_> = b.iter().collect();
42
    ka.sort();
43
    kb.sort();
44
    assert_eq!(ka, kb, "the two implementations must agree");
45

46
    println!("distinct words: {}, 'the' = {}", a.len(), a["the"]);
47
}

A representative --release run:

1
count_naive: 103.63775ms
2
count_fast: 56.74ms
3
distinct words: 9, 'the' = 600000

The naive version allocates a String for every one of the 2.4 million word occurrences; the fast version allocates only nine times (once per distinct word). Roughly halving the time, with assert_eq! confirming identical counts, is the textbook outcome of measure → fix the proven hot spot → re-measure. (If the corpus contained mixed case, you would need case-insensitive keys and the trade-off would be different — another reason to measure against a representative workload.)

Exercise 3: Decide With a Benchmark — and Honor the Result

Difficulty: Advanced

Objective: Use criterion to compare two correct implementations and practice the hardest part of the loop: keeping the simpler code when the “optimization” doesn’t clearly win.

Instructions: Create a --dev-dependency on criterion (cargo add criterion --dev) and a benches/dedup.rs registered with harness = false. Benchmark two ways to count distinct values in a &[u32] with many duplicates: distinct_sort (clone, sort_unstable, dedup, len) versus distinct_set (collect into a HashSet, take len). Wrap inputs in black_box. Run cargo bench and write down which you would ship — and why.

1
[dev-dependencies]
2
criterion = "0.8"
3

4
[[bench]]
5
name = "dedup"
6
harness = false

1
use std::collections::HashSet;
2
use criterion::{criterion_group, criterion_main, Criterion, black_box};
3

4
fn distinct_sort(input: &[u32]) -> usize {
5
    // TODO: clone, sort_unstable, dedup, return len
6
    todo!()
7
}
8

9
fn distinct_set(input: &[u32]) -> usize {
10
    // TODO: collect into a HashSet, return len
11
    todo!()
12
}
13

14
fn bench(c: &mut Criterion) {
15
    // TODO: build a duplicate-heavy dataset and bench both, using black_box
16
    let _ = c;
17
}
18

19
criterion_group!(benches, bench);
20
criterion_main!(benches);

Solution

1
use std::collections::HashSet;
2
use criterion::{criterion_group, criterion_main, Criterion, black_box};
3

4
// Approach A: sort then dedup (allocates one sorted copy).
5
fn distinct_sort(input: &[u32]) -> usize {
6
    let mut v = input.to_vec();
7
    v.sort_unstable();
8
    v.dedup();
9
    v.len()
10
}
11

12
// Approach B: HashSet.
13
fn distinct_set(input: &[u32]) -> usize {
14
    let set: HashSet<u32> = input.iter().copied().collect();
15
    set.len()
16
}
17

18
fn bench(c: &mut Criterion) {
19
    // Duplicate-heavy: 100k values drawn from only 5k distinct keys.
20
    let data: Vec<u32> = (0..100_000u32).map(|i| (i.wrapping_mul(2654435761)) % 5000).collect();
21

22
    let mut group = c.benchmark_group("distinct");
23
    group.bench_function("sort", |b| b.iter(|| distinct_sort(black_box(&data))));
24
    group.bench_function("set", |b| b.iter(|| distinct_set(black_box(&data))));
25
    group.finish();
26
}
27

28
criterion_group!(benches, bench);
29
criterion_main!(benches);

A representative cargo bench run:

1
distinct/sort           time:   [879.07 µs 904.35 µs 946.40 µs]
2
distinct/set            time:   [956.96 µs 985.16 µs 1.0266 ms]

On this workload the two are within roughly 10% of each other — not a decisive win for either. That is the lesson: the benchmark didn’t crown a clear champion, so you ship whichever is clearer for your codebase (often the HashSet, which expresses intent directly and doesn’t mutate a copy), and you do not invent a third “clever” variant chasing a margin this thin. The honest answer to “which is faster?” is sometimes “it doesn’t matter — pick the readable one.” (The relationship would flip with a different distribution; for instance, very few duplicates makes the HashSet relatively worse — which is precisely why you benchmark your real data, per Benchmarking.)

When to Optimize: Measure First, Premature Optimization, Readable-Then-Fast

Quick Overview

The Core Loop

TypeScript/JavaScript Example

Rust Equivalent

Detailed Explanation

Why “measure first” matters more in compiled languages

The release-build trap

”Looks expensive” vs. “is expensive”: a worked decision

Key Differences

Common Pitfalls

Pitfall 1: Benchmarking a debug build

Pitfall 2: Optimizing the thing that looks slow

Pitfall 3: Reaching for cleverness that doesn’t even compile

Pitfall 4: Trusting one wall-clock number

Pitfall 5: Optimizing before you have a representative workload

Best Practices

Real-World Example

Further Reading

Cross-links within this guide

Exercises

Exercise 1: Build a Timing Harness and Compare Two Approaches

Exercise 2: Find and Fix the Proven Hot Spot

Exercise 3: Decide With a Benchmark — and Honor the Result