Parallel Iterators with Rayon

21 min read

In JavaScript and TypeScript, array.map() and array.filter() are always single-threaded. Even on a machine with eight cores, a plain Array.prototype.filter runs on exactly one of them. Rust’s Rayon crate lets you turn a sequential iterator into a parallel one by changing a single method call — .iter() becomes .par_iter() — and the work fans out across every CPU core, with the borrow checker guaranteeing there are no data races.

Quick Overview

A parallel iterator processes the elements of a collection across multiple threads at once, then combines the results. Rayon provides drop-in parallel versions of the iterator adapters you already know (map, filter, sum, collect, reduce), so converting a sequential pipeline to a parallel one is usually a one-word change. The catch — and the focus of this chapter — is that parallelism only pays off when there is enough independent, CPU-bound work to overcome the cost of coordinating threads.

Note: This chapter uses Rust 1.96.0 on the latest stable edition (2024). The examples use rayon 1.12.0; run cargo add rayon in a fresh project to pull the current version.

TypeScript/JavaScript Example

In Node.js (here, v22), the built-in array methods are synchronous and single-threaded. Counting primes below two million keeps one core busy while the other seven sit idle:

1
// prime-count.ts — single-threaded, no matter how many cores you have
2
function isPrime(n: number): boolean {
3
  if (n < 2) return false;
4
  for (let d = 2; d * d <= n; d++) {
5
    if (n % d === 0) return false;
6
  }
7
  return true;
8
}
9

10
const numbers: number[] = Array.from({ length: 2_000_000 - 2 }, (_, i) => i + 2);
11

12
const start = performance.now();
13
const count = numbers.filter(isPrime).length;
14
const ms = (performance.now() - start).toFixed(1);
15

16
console.log(`primes: ${count} in ${ms} ms (single-threaded)`);

Running it with node --experimental-strip-types prime-count.ts on an 8-core machine:

1
primes: 148933 in 1254.7 ms (single-threaded)

To actually use the other cores in Node you reach for worker_threads: spawn workers, split the range yourself, send each chunk over a MessageChannel, run the computation in each worker, and merge the partial results back in the main thread. That is a lot of boilerplate — manual chunking, message serialization, lifecycle management — for what is conceptually still “filter this array.” There is no array.parallelFilter().

Note: A Worker runs real OS-thread-backed code, but data sent across the channel is copied (structured clone) unless you use a SharedArrayBuffer. JavaScript has no shared-memory data parallelism for ordinary objects, which is exactly the boilerplate Rayon removes.

Rust Equivalent

In Rust, the parallel version is the sequential version with iter() swapped for par_iter() (after importing the Rayon prelude):

1
// Cargo.toml: run `cargo add rayon`
2
use rayon::prelude::*;
3
use std::time::Instant;
4

5
fn is_prime(n: u64) -> bool {
6
    if n < 2 {
7
        return false;
8
    }
9
    let mut d = 2;
10
    while d * d <= n {
11
        if n % d == 0 {
12
            return false;
13
        }
14
        d += 1;
15
    }
16
    true
17
}
18

19
fn main() {
20
    let numbers: Vec<u64> = (2..2_000_000).collect();
21
    println!("threads: {}", rayon::current_num_threads());
22

23
    // Sequential: one core.
24
    let t = Instant::now();
25
    let seq = numbers.iter().filter(|&&n| is_prime(n)).count();
26
    println!("sequential: {seq} primes in {:?}", t.elapsed());
27

28
    // Parallel: every core. The ONLY change is `iter` -> `par_iter`.
29
    let t = Instant::now();
30
    let par = numbers.par_iter().filter(|&&n| is_prime(n)).count();
31
    println!("parallel:   {par} primes in {:?}", t.elapsed());
32
}

Real output from cargo run --release on the same 8-core machine:

1
threads: 8
2
sequential: 148933 primes in 188.238375ms
3
parallel:   148933 primes in 40.672ms

Two things stand out. First, Rust’s sequential version (188 ms) is already ~6.7x faster than Node’s (1255 ms) because the work is compiled and monomorphized rather than interpreted. Second, par_iter takes that 188 ms down to ~41 ms — roughly a 4.6x speedup on 8 cores — for the price of changing one word. The prime count, 148933, is identical to Node’s, so the parallel result is correct.

Note: Always benchmark parallel code with cargo run --release (or cargo bench). A debug build leaves the per-element work unoptimized, which inflates the apparent speedup and tells you nothing about production performance.

Detailed Explanation

The prelude brings the parallel methods into scope

1
use rayon::prelude::*;

This single import adds the par_iter, par_iter_mut, and into_par_iter methods to standard collections (Vec, slices, HashMap, BTreeMap, ranges, and more) through the IntoParallelIterator and IntoParallelRefIterator traits. Without it, par_iter simply does not exist as a method — see Common Pitfalls. This mirrors how use std::io::Write; is required before write! works on a file: the trait must be in scope.

Three ways in: `par_iter`, `par_iter_mut`, `into_par_iter`

These map directly onto the three sequential forms a TypeScript developer already reasons about as “borrow shared / borrow mutable / take ownership”:

Sequential	Parallel	Yields	Use when
`v.iter()`	`v.par_iter()`	`&T`	You only need to read each element
`v.iter_mut()`	`v.par_iter_mut()`	`&mut T`	You want to mutate each element in place
`v.into_iter()`	`v.into_par_iter()`	`T`	You can consume the collection

par_iter_mut is the parallel sweet spot for in-place transforms: because each thread gets a disjoint, non-overlapping &mut T, there is no aliasing and no synchronization needed — the borrow checker proves the slices don’t overlap.

What a `ParallelIterator` is (and is not)

A Rayon parallel iterator is lazy, just like a std::iter::Iterator. Nothing runs until a consuming operation — collect, sum, reduce, count, for_each, find_any — drives it. The adapters in between (map, filter, filter_map, flat_map) only build up a description of the work. This is the same laziness model as the standard library iterators covered in Section 02, not the eager evaluation of a JavaScript array method (which materializes a new array at every .map).

Internally, Rayon uses work stealing over a divide-and-conquer split: it recursively halves the index range, hands the halves to a global thread pool, and idle threads “steal” pending halves from busy ones. You don’t manage any of this. The pool and its join primitive are covered in thread-pools.md; this chapter stays at the iterator level.

Reductions: `sum`, `reduce`, and order independence

1
use rayon::prelude::*;
2

3
fn main() {
4
    // map + collect — order is PRESERVED for indexed sources like ranges and Vec.
5
    let squares: Vec<u64> = (1..=8).into_par_iter().map(|n| n * n).collect();
6
    println!("squares: {squares:?}");
7

8
    // sum — a built-in parallel reduction.
9
    let total: u64 = (1..=1_000_000u64).into_par_iter().sum();
10
    println!("sum 1..=1_000_000: {total}");
11

12
    // reduce — explicit identity + associative combiner (here, factorial of 10).
13
    let product: u64 = (1..=10u64).into_par_iter().reduce(|| 1, |a, b| a * b);
14
    println!("10! = {product}");
15

16
    // find_any short-circuits across threads.
17
    let found = (1..1_000_000u64).into_par_iter().find_any(|&n| n * n == 1_000_000);
18
    println!("found: {found:?}");
19
}

Output:

1
squares: [1, 4, 9, 16, 25, 36, 49, 64]
2
sum 1..=1_000_000: 500000500000
3
10! = 3628800
4
found: Some(1000)

The crucial concept in any parallel reduction is associativity. Rayon splits the data into chunks, reduces each chunk on a separate thread, and then combines the per-chunk results in an unspecified order. For sum and product that is fine because (a + b) + c == a + (b + c). But your combiner must not depend on order:

reduce(|| 1, |a, b| a * b) is safe — multiplication is associative.
A combiner that subtracts, or that appends to a string expecting left-to-right order, would produce a different result on every run. Rayon’s reduce gives you find_any semantics, not find_first: it returns some matching element, not necessarily the first by index.

If you need the first match by position, use find_first instead of find_any; it pays a small coordination cost to honor ordering.

Order preservation in `collect`

Note that collect does preserve order when the source is indexed (a range, Vec, or slice): squares above comes back [1, 4, 9, ...], not shuffled. Rayon tracks each element’s position and reassembles the output Vec in source order, even though the work ran out of order. The shuffle risk is specific to reduce/fold combiners and to par_bridge (next section), not to collect.

Key Differences

`par_iter` vs Node `worker_threads`

Aspect	Node.js `worker_threads`	Rayon `par_iter`
Code change to parallelize	Spawn workers, chunk data, post/receive messages, merge	Change `iter()` to `par_iter()`
Memory model	Data copied across channel (or `SharedArrayBuffer` by hand)	Shared memory; threads borrow disjoint slices
Data-race safety	Your responsibility	Guaranteed by the borrow checker at compile time
Thread pool	You create and manage workers	Global pool created lazily, reused
Scheduling	Manual chunking	Automatic work-stealing, load-balanced
Result ordering	Whatever your merge logic does	`collect` preserves order; reductions need associativity

`par_iter` vs `par_bridge`

Not every iterator can be split into halves cheaply. A Vec knows its length and can be indexed, so Rayon splits it directly. A sequential iterator like str::lines() or a File’s line reader can only be advanced one item at a time — Rayon can’t jump to the middle. For those, par_bridge adapts any Iterator into a ParallelIterator by pulling items one at a time (under a lock) and feeding them to worker threads:

1
use rayon::prelude::*;
2
use std::collections::HashMap;
3

4
fn expensive_hash(s: &str) -> u64 {
5
    // Stand-in for genuinely CPU-heavy per-item work.
6
    let mut h = 0u64;
7
    for _ in 0..50_000 {
8
        h = 1469598103934665603;
9
        for b in s.bytes() {
10
            h ^= b as u64;
11
            h = h.wrapping_mul(1099511628211);
12
        }
13
    }
14
    h
15
}
16

17
fn main() {
18
    let text = "alpha\nbeta\ngamma\ndelta\nepsilon\nzeta";
19

20
    // `lines()` is a sequential iterator: bridge it into a parallel one.
21
    let results: HashMap<String, u64> = text
22
        .lines()
23
        .par_bridge()
24
        .map(|line| (line.to_string(), expensive_hash(line)))
25
        .collect();
26

27
    let mut keys: Vec<_> = results.keys().cloned().collect();
28
    keys.sort();
29
    for k in keys {
30
        println!("{k} -> {}", results[&k]);
31
    }
32
}

Output (sorted for stable display):

1
alpha -> 6542418319912364133
2
beta -> 17583068548789615225
3
delta -> 14161400069455568611
4
epsilon -> 11109341111963135187
5
gamma -> 4439282355344678600
6
zeta -> 5298269982014079025

Two caveats with par_bridge:

It does not preserve order. Items are consumed sequentially but processed in whatever order threads finish. Collecting into a Vec would give you an unspecified order — collect into a HashMap (as above), sort afterward, or use par_iter on an indexed collection if order matters.
The producer is a bottleneck. Pulling items happens under a mutex, so if producing each item is itself slow (e.g. blocking I/O), par_bridge only parallelizes the processing, not the production. The win comes entirely from the per-item work being expensive relative to the cost of pulling it.

Tip: When you can, read all the data into a Vec first and use par_iter — it splits more efficiently than par_bridge and preserves order. Reach for par_bridge only when the source genuinely cannot be collected up front, or when each item is so expensive that the pull cost is negligible.

Rust is not parallel by default

A Vec::iter() chain runs on one thread. Parallelism in Rust is explicit and opt-in — you ask for it by writing par_iter. This is the same philosophy as the rest of the language: zero cost you didn’t request. Contrast this with the common misconception that “Rust is multi-threaded by default” — it is not. What Rust gives you is fearless concurrency: when you do opt in, the Send/Sync trait bounds and the borrow checker prevent data races at compile time. The standard threading model this builds on is covered in threads.md.

Common Pitfalls

Forgetting the prelude import

The single most common first error. Without use rayon::prelude::*;, the parallel methods are not in scope:

1
// does not compile (error[E0599]): missing `use rayon::prelude::*;`
2
fn main() {
3
    let v: Vec<i32> = (1..=10).collect();
4
    let sum: i32 = v.par_iter().sum();
5
    println!("{sum}");
6
}

The real compiler error:

1
error[E0599]: no method named `par_iter` found for struct `Vec<i32>` in the current scope
2
 --> src/main.rs:3:22
3
  |
4
3 |     let sum: i32 = v.par_iter().sum();
5
  |                      ^^^^^^^^
6
  |
7
help: there is a method `iter` with a similar name
8
  |
9
3 -     let sum: i32 = v.par_iter().sum();
10
3 +     let sum: i32 = v.iter().sum();
11
  |

The fix is the import; the compiler’s suggestion to use iter would silently make the code sequential, which is not what you want here.

Mutating shared state inside `for_each`

A TypeScript developer’s instinct is to push into an outer array from inside the loop. Rayon’s closures are Fn (callable from many threads at once), so they cannot capture an outer variable by mutable reference:

1
// does not compile (error[E0596]): cannot mutate captured `results` from a parallel closure
2
use rayon::prelude::*;
3

4
fn main() {
5
    let mut results = Vec::new();
6
    (0..100).into_par_iter().for_each(|n| {
7
        results.push(n * n); // many threads, one Vec -> data race, rejected at compile time
8
    });
9
    println!("{}", results.len());
10
}

The real compiler error:

1
error[E0596]: cannot borrow `results` as mutable, as it is a captured variable in a `Fn` closure
2
 --> src/main.rs:6:9
3
  |
4
6 |         results.push(n * n); // many threads, one Vec -> data race, rejected at compile time
5
  |         ^^^^^^^ cannot borrow as mutable

This is the borrow checker stopping a data race before it can exist. The idiomatic fix is not a lock — it’s to map and collect, letting Rayon assemble the result for you:

1
use rayon::prelude::*;
2

3
fn main() {
4
    // Idiomatic: no shared mutable Vec, no lock. collect() reassembles in order.
5
    let results: Vec<u64> = (0..10).into_par_iter().map(|n| n * n).collect();
6
    println!("squares: {results:?}");
7
}

Output:

1
squares: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

If you genuinely need to accumulate into shared state (rare), wrap it in a Mutex or — better — use a fold + reduce pair so each thread accumulates locally and you merge at the end (see the Real-World Example). Atomics are an option for simple counters; see atomic-operations.md.

Parallelizing cheap work on small inputs

Parallelism is not free: splitting, dispatching to the pool, and joining all cost time. When the per-element work is trivial and the collection is small, that overhead dwarfs the actual computation and parallel is slower:

1
use rayon::prelude::*;
2
use std::time::Instant;
3

4
fn main() {
5
    let data: Vec<u64> = (0..1_000).collect(); // small input, trivial work
6

7
    let _: u64 = data.par_iter().sum(); // warm up the pool
8

9
    let runs = 1000;
10
    let (mut seq_total, mut par_total) = (0u128, 0u128);
11
    for _ in 0..runs {
12
        let t = Instant::now();
13
        let s1: u64 = data.iter().map(|&x| x + 1).sum();
14
        seq_total += t.elapsed().as_nanos();
15
        std::hint::black_box(s1);
16

17
        let t = Instant::now();
18
        let s2: u64 = data.par_iter().map(|&x| x + 1).sum();
19
        par_total += t.elapsed().as_nanos();
20
        std::hint::black_box(s2);
21
    }
22
    println!("sequential avg: {} ns", seq_total / runs);
23
    println!("parallel avg:   {} ns", par_total / runs);
24
}

Output:

1
sequential avg: 44005 ns
2
parallel avg:   2216750 ns

Here the parallel version is ~50x slower. Adding 1 to a thousand numbers takes microseconds; the thread coordination takes milliseconds. The rule of thumb: parallelize when you have both a large number of elements and meaningful work per element. When in doubt, measure with criterion (benchmarking is covered in Section 21).

Result reordering surprises

par_bridge and reduce/fold combiners do not preserve input order. If your code assumes the output is in the same order as the input, use par_iter on an indexed collection with collect (which does preserve order), or use find_first/collect_into_vec rather than find_any. Never assume order from a parallel reduction.

Best Practices

Prefer map + collect (or sum/reduce) over for_each with shared state. Expressing the computation as a pure transformation lets Rayon handle accumulation safely and lock-free.
Use fold + reduce for per-thread accumulation. When building a map or histogram, fold gives each thread a local accumulator and reduce merges them — far better than contending on a single Mutex.
Benchmark in --release, on representative input sizes. A speedup in debug mode is meaningless; an input that’s small in your test may be large in production (or vice versa).
Keep closures pure and side-effect-free. Parallel closures should compute from their inputs, not reach out to mutate the world. This is also what makes them trivially correct.
Reach for par_bridge only when you can’t collect up front. Prefer reading into a Vec and using par_iter, which splits efficiently and preserves order.
Tune granularity only if profiling demands it. Methods like .with_min_len(n) let you batch small items so each task does at least n of them, amortizing dispatch cost. Start without it; add it only if benchmarks show task overhead dominating.
Don’t parallelize I/O-bound work with Rayon. Rayon’s pool is sized for CPU cores. For waiting on the network or disk, use async (tokio) or dedicated threads, not par_iter. See channels.md for producer/consumer pipelines across threads.

Warning: Rayon’s closures run on a shared global pool. If a closure blocks (sleeps, waits on I/O, or calls another blocking par_iter), it ties up a pool thread and can starve other parallel work or even deadlock. Keep parallel closures CPU-bound and non-blocking.

Real-World Example

A common production task: aggregate word frequencies across a large corpus of documents. Each document is processed independently (embarrassingly parallel), then the per-document counts are merged. The fold + reduce pattern lets each thread build a local HashMap and merge them at the end — no lock contention on a shared map:

1
// Cargo.toml: run `cargo add rayon`
2
use rayon::prelude::*;
3
use std::collections::HashMap;
4
use std::time::Instant;
5

6
/// Process one document: normalize and count words.
7
fn word_counts(doc: &str) -> HashMap<String, u32> {
8
    let mut counts = HashMap::new();
9
    for word in doc.split_whitespace() {
10
        let normalized: String = word
11
            .chars()
12
            .filter(|c| c.is_alphanumeric())
13
            .flat_map(|c| c.to_lowercase())
14
            .collect();
15
        if !normalized.is_empty() {
16
            *counts.entry(normalized).or_insert(0) += 1;
17
        }
18
    }
19
    counts
20
}
21

22
/// Merge two partial frequency maps into one.
23
fn merge(mut a: HashMap<String, u32>, b: HashMap<String, u32>) -> HashMap<String, u32> {
24
    for (k, v) in b {
25
        *a.entry(k).or_insert(0) += v;
26
    }
27
    a
28
}
29

30
fn main() {
31
    // Synthetic corpus of 10,000 documents.
32
    let base = "the quick brown fox jumps over the lazy dog the fox runs fast";
33
    let docs: Vec<String> = (0..10_000).map(|i| format!("{base} doc{i}")).collect();
34

35
    let t = Instant::now();
36
    let totals: HashMap<String, u32> = docs
37
        .par_iter()
38
        .map(|doc| word_counts(doc))   // each doc -> its own map, in parallel
39
        .reduce(HashMap::new, merge);  // merge all the partial maps
40
    let elapsed = t.elapsed();
41

42
    let mut top: Vec<(&String, &u32)> = totals.iter().collect();
43
    top.sort_by(|a, b| b.1.cmp(a.1).then(a.0.cmp(b.0)));
44

45
    println!("processed {} docs in {elapsed:?}", docs.len());
46
    println!("top 5 words:");
47
    for (word, count) in top.iter().take(5) {
48
        println!("  {word}: {count}");
49
    }
50
}

Real output from cargo run --release:

1
processed 10000 docs in 48.260709ms
2
top 5 words:
3
  the: 30000
4
  fox: 20000
5
  brown: 10000
6
  dog: 10000
7
  fast: 10000

The shape of this code is the same map-reduce a TypeScript developer would write — docs.map(wordCounts).reduce(merge) — but it runs across all cores with no manual chunking, no worker spawning, and no message passing. Because merge is associative (merge(merge(a, b), c) == merge(a, merge(b, c))), Rayon is free to combine the partial maps in any order, and the borrow checker guarantees no two threads ever touch the same map at once.

Tip: reduce(HashMap::new, merge) takes an identity constructor (HashMap::new, a function returning the empty value) and an associative combiner. This is the parallel analog of Array.prototype.reduce(merge, {}) — the key difference being that the identity is created per chunk, so it must be a fresh value each time, hence a function rather than a single shared object.

When the corpus is too large to hold in memory, combine this with file-system.md for directory walking, processing each file’s path with par_iter. For workloads where the security of processing untrusted input matters, see Section 27.

Exercises

Exercise 1: One-word parallelization

Difficulty: Beginner

Objective: Confirm that the iter() → par_iter() swap composes with a chain of adapters.

Instructions: Compute the sum of the squares of all even numbers from 1 to 1,000,000, using a parallel iterator. Start from this sequential stub and parallelize it:

1
fn main() {
2
    let total: u64 = (1..=1_000_000u64)
3
        .into_iter()
4
        .filter(|n| n % 2 == 0)
5
        .map(|n| n * n)
6
        .sum();
7
    println!("{total}");
8
}

Note: The bare range already implements both IntoIterator and (with Rayon imported) IntoParallelIterator, so the explicit .into_iter() here is redundant — clippy will flag it as useless_conversion. It is shown only to make the one-word swap below visually obvious: replace .into_iter() with .into_par_iter() and nothing else changes.

Solution

1
// Cargo.toml: run `cargo add rayon`
2
use rayon::prelude::*;
3

4
fn main() {
5
    let total: u64 = (1..=1_000_000u64)
6
        .into_par_iter()           // the only change
7
        .filter(|n| n % 2 == 0)
8
        .map(|n| n * n)
9
        .sum();
10
    println!("{total}");
11
}

Output:

1
166667166667000000

sum is an associative reduction over u64, so the parallel result is identical to the sequential one. The filter and map adapters compose with the parallel iterator exactly as they do with a sequential one.

Exercise 2: Parallel argmax

Difficulty: Intermediate

Objective: Use a parallel reduction that returns more than a single number.

Instructions: For every starting value n from 1 to 1,000,000, compute the number of steps the Collatz sequence takes to reach 1. Find the n (in that range) that takes the most steps, and print both the n and the step count. Do the search in parallel.

Solution

1
// Cargo.toml: run `cargo add rayon`
2
use rayon::prelude::*;
3

4
fn collatz_steps(mut n: u64) -> u32 {
5
    let mut steps = 0;
6
    while n != 1 {
7
        n = if n % 2 == 0 { n / 2 } else { 3 * n + 1 };
8
        steps += 1;
9
    }
10
    steps
11
}
12

13
fn main() {
14
    let (best_n, best_steps) = (1..=1_000_000u64)
15
        .into_par_iter()
16
        .map(|n| (n, collatz_steps(n)))
17
        .max_by_key(|&(_, steps)| steps)
18
        .unwrap();
19
    println!("{best_n} -> {best_steps} steps");
20
}

Output:

1
837799 -> 524 steps

max_by_key is a parallel reduction: each thread finds the max in its chunk, then the per-chunk maxima are combined. Because “maximum” is associative, the order in which chunks finish does not affect the answer. The .unwrap() is safe because the range is non-empty.

Exercise 3: Parallel histogram with fold + reduce

Difficulty: Advanced

Objective: Build a shared map from parallel work without a lock, using per-thread accumulation.

Instructions: Given a block of text, build a histogram of word lengths: a map from each word length to how many words have that length. Process the words in parallel. Each thread should accumulate into its own HashMap (with fold), and the per-thread maps should be merged at the end (with reduce). Print the lengths in ascending order.

Hint: Rayon provides par_split_whitespace() on &str, and fold takes an identity constructor plus a folding closure.

Solution

1
// Cargo.toml: run `cargo add rayon`
2
use rayon::prelude::*;
3
use std::collections::HashMap;
4

5
fn main() {
6
    let text = "the quick brown fox jumps over the lazy dog \
7
                a parallel iterator splits work across cores";
8

9
    let histogram: HashMap<usize, u32> = text
10
        .par_split_whitespace()
11
        .fold(HashMap::new, |mut acc, word| {
12
            *acc.entry(word.len()).or_insert(0) += 1;
13
            acc
14
        })
15
        .reduce(HashMap::new, |mut a, b| {
16
            for (k, v) in b {
17
                *a.entry(k).or_insert(0) += v;
18
            }
19
            a
20
        });
21

22
    let mut lengths: Vec<_> = histogram.into_iter().collect();
23
    lengths.sort();
24
    for (len, count) in lengths {
25
        println!("length {len}: {count} word(s)");
26
    }
27
}

Output:

1
length 1: 1 word(s)
2
length 3: 4 word(s)
3
length 4: 3 word(s)
4
length 5: 4 word(s)
5
length 6: 2 word(s)
6
length 8: 2 word(s)

The fold step gives each worker thread its own HashMap accumulator, so threads never contend for a shared lock. The reduce step merges those partial maps; merging maps by summing counts is associative, so the result is deterministic even though the work runs out of order. This fold + reduce pattern is the lock-free way to build shared aggregates in parallel — far better than wrapping a single HashMap in a Mutex.

Parallel Iterators with Rayon

Quick Overview

TypeScript/JavaScript Example

Rust Equivalent

Detailed Explanation

The prelude brings the parallel methods into scope

Three ways in: par_iter, par_iter_mut, into_par_iter

What a ParallelIterator is (and is not)

Reductions: sum, reduce, and order independence

Order preservation in collect

Key Differences

par_iter vs Node worker_threads

par_iter vs par_bridge

Rust is not parallel by default

Common Pitfalls

Forgetting the prelude import

Mutating shared state inside for_each

Parallelizing cheap work on small inputs

Result reordering surprises

Best Practices

Real-World Example

Further Reading

Exercises

Exercise 1: One-word parallelization

Exercise 2: Parallel argmax

Exercise 3: Parallel histogram with fold + reduce

Three ways in: `par_iter`, `par_iter_mut`, `into_par_iter`

What a `ParallelIterator` is (and is not)

Reductions: `sum`, `reduce`, and order independence

Order preservation in `collect`

`par_iter` vs Node `worker_threads`

`par_iter` vs `par_bridge`

Mutating shared state inside `for_each`