WebAssembly Performance: Bundle Size and the Boundary Cost
26 min read
Compiling Rust to WebAssembly does not automatically make your app faster. Two costs decide whether the rewrite pays off: the bytes a user downloads before anything runs, and the price of every call across the JavaScript↔WASM boundary. This page measures both with real tools (wasm-opt, twiggy) and gives you a concrete decision rule for when WebAssembly actually wins over plain JavaScript.
A .wasm module is not free: the browser must download it, parse it, and instantiate it before your first function runs, and every value you pass into or out of it has to be marshalled across a boundary that natively speaks only numbers. For a TypeScript/JavaScript developer the mental model is “a second runtime you ship alongside your bundle, reached through a typed FFI.” The performance work is therefore two-pronged: shrink the binary (with wasm-opt and the twiggy size profiler) so the download is small, and design the API to cross the boundary rarely with large payloads rather than often with tiny ones. WebAssembly wins decisively on CPU-bound, self-contained work — numeric kernels, parsing, compression, image processing — and loses on chatty, DOM-heavy, or trivially small tasks where the boundary and load costs dominate.
Note: This file is about measuring and tuning performance. The build pipeline itself (crate-type, --target) lives in wasm-pack.md; the mechanics of what crosses the boundary and how it is encoded live in wasm-bindgen.md. Read those first if the terms cdylib or JsValue are new.
A real frontend bundle has a download budget. Teams track it with tools like source-map-explorer, gzip every asset, and obsess over keeping the initial payload small because every kilobyte delays time-to-interactive. Here is the kind of CPU-bound workload a team might consider moving to WebAssembly — a numeric normalization pass over a large array, plus a tight prime-counting loop — written in idiomatic TypeScript first:
1
// signal.ts — CPU-bound numeric work, the candidate for a WASM rewrite.
2
3
// Normalize an array so its largest value becomes 1.0.
In pure JavaScript that loop is fine — a function call inside V8 is cheap. The moment normalizeOne lives in WebAssembly, each call pays a boundary-crossing tax a million times over, and the “fast” Rust version can end up slower than the JavaScript it replaced. The rest of this page is about avoiding exactly that trap and shrinking what you ship.
Set up a standard cdylib library crate. The current stable toolchain is Rust 1.96.0 on the 2024 edition; cargo new selects it automatically.
Terminal window
1
cargonew--libsignal
2
cdsignal
3
cargoaddwasm-bindgen
Cargo.toml
1
[package]
2
name="signal"
3
version="0.1.0"
4
edition="2024"
5
6
[lib]
7
crate-type=["cdylib","rlib"]
8
9
[dependencies]
10
wasm-bindgen="0.2.122"
The crate exposes both the wrong and the right boundary shapes so we can contrast them. The key idea: pass the whole buffer across once and loop inside WASM, rather than calling across the boundary per element.
src/lib.rs
1
usewasm_bindgen::prelude::*;
2
3
/// BAD boundary design: one call per element. Cheap inside WASM, but every
4
/// call re-crosses the JS↔WASM boundary, so the marshalling cost is paid N times.
5
#[wasm_bindgen]
6
pubfnnormalize_one(value:f64,max:f64)->f64{
7
ifmax==0.0{
8
0.0
9
}else{
10
value/max
11
}
12
}
13
14
/// GOOD boundary design: hand the whole buffer across once. wasm-bindgen copies
15
/// the JS `Float64Array` into linear memory a single time; the loop then runs
16
/// entirely inside WASM with no further boundary crossings, and the result
Running it on the host prints the real, verified results:
1
count_primes(100) = 25
2
count_primes(1_000_000) = 78498
3
normalize_all([2,4,8]) = [0.25, 0.5, 1.0]
Now build for the browser and measure what actually ships. The wasm-pack build pipeline (compile → wasm-bindgen → wasm-opt) is covered in wasm-pack.md; here we care about the size at each stage, captured from a real build of the crate above:
after wasm-opt -Os (wasm-pack default) 17,737 bytes
7
─────────────────────────────────────────────────
8
gzip on the wire 8,203 bytes
9
brotli on the wire 7,017 bytes
Note: Exact byte counts vary by rustc/wasm-bindgen/wasm-opt patch version and platform, so a fresh reproduction will land near — not exactly on — these numbers. What is stable is the shape: a multi-megabyte debug build, a tens-of-KB release, a further drop through wasm-bindgen and wasm-opt, and roughly halving again on the wire.
Two lessons jump out immediately. First, never ship a debug build — at 2.5 MB it is ~150× larger than the optimized release. Second, the bytes a user actually downloads are the compressed size (8 KB gzip / 7 KB brotli), not the on-disk 17.7 KB; serving WASM with HTTP compression matters as much as wasm-opt.
The size table above is a pipeline, and each stage strips something different:
Debug → release (2.5 MB → 43 KB). The debug .wasm is dominated by debug info (DWARF) and zero optimization. cargo build --release (which wasm-pack runs by default) turns on opt-level = 3 and drops most of that. This single step is the biggest win and is automatic.
release → after wasm-bindgen (43 KB → 24.7 KB). The wasm-bindgen CLI rewrites the module to add the JS-glue-facing exports and runs the linker’s dead-code elimination, dropping Rust standard-library code your exports never reach.
wasm-bindgen → after wasm-opt (24.7 KB → 17.7 KB).wasm-opt (from the Binaryen toolkit) is a dedicated WASM-to-WASM optimizer. It does instruction-level shrinking, more aggressive dead-code elimination, and (with -Os/-Oz) size-focused rewrites. wasm-pack runs it for you with -Os.
wasm-opt → on the wire (17.7 KB → ~7-8 KB). Gzip/brotli on the HTTP layer roughly halves the binary again. WASM compresses well; this is “free” if your server is configured for it.
Note: Unlike a JavaScript bundle, where minification (terser/esbuild) and tree-shaking happen in your bundler, the WASM equivalents (wasm-bindgen’s DCE and wasm-opt) happen in the Rust build pipeline before the bundler ever sees the file. Your Vite/webpack config does not shrink the .wasm; wasm-opt does.
wasm-opt is to a .wasm file what terser is to a .js file. wasm-pack invokes it automatically, but you can run it by hand to compare optimization levels. On the wasm-bindgen output of the crate above (24,682 bytes in), the real results were:
For this module the three levels land within ~600 bytes of each other; -Os and -Oz are nearly identical, and the speed-focused -O3 is actually the largest because it inlines and unrolls. The practical guidance: -Os is the right default (and is what wasm-pack uses); reach for -Oz only when every kilobyte counts and you have benchmarked that the speed cost is acceptable, and -O3 only when you have proven a hot path benefits from it.
Note: Current Rust emits bulk-memory instructions (memory.copy, memory.fill), and wasm-bindgen embeds a target_features custom section in its output that declares them. A modern wasm-opt reads that section and auto-enables the matching features, so running wasm-opt -Os input.wasm -o output.wasm on real wasm-bindgen output just works — no extra flag, even with --mvp-features. You only hit a validation error like this real wasm-opt 129 message when the module lacks the target_features section (an older toolchain, or a hand-assembled module):
1
[wasm-validator error in function 3] unexpected false: memory.copy operations
2
require bulk memory operations [--enable-bulk-memory-opt], on
3
(memory.copy ...)
4
Fatal: error validating input
In that edge case the fix is to pass the feature flag explicitly: wasm-opt -Os --enable-bulk-memory-opt input.wasm -o output.wasm. With the documented toolchain (current Rust + wasm-bindgen) the section is present, so neither wasm-pack nor a manual wasm-opt needs the flag.
When 17 KB is somehow 170 KB, you need to know which Rust code is responsible. twiggy is a code-size profiler for .wasm (think of it as source-map-explorer for WebAssembly). Install it once with cargo install twiggy, then run twiggy top against the pre-wasm-opt module (which still carries the function-name section twiggy reads). Real output from the crate above:
This is enormously informative. The single biggest code item is dlmalloc::malloc — the default Rust allocator, baked into the binary at ~4.5 KB. The "function names" subsection (another ~4.4 KB) is the debug-name table that profiling needs but shipping does not. The string-handling helpers (do_count_chars, __rdl_realloc) come from wasm-bindgen’s string/array marshalling glue and the default allocator, not from your own logic. Knowing this, you can act: drop the name section for the shipped build, or reconsider whether a function that pulls in formatting/allocation is worth its weight.
twiggy garbage finds items that are present but unreachable — pure waste you can strip:
Tip: Profile the module with names, ship the module without them. After wasm-opt strips the name section, twiggy top falls back to opaque labels like code[0], code[17] — useful for sizing but not for attribution. Keep an un-stripped copy around for diagnosis.
WebAssembly functions natively accept and return only i32, i64, f32, and f64. Anything richer — a string, a typed array, a struct — must be encoded into those primitives, written into the module’s linear memory, and decoded on the other side. wasm-bindgen generates that marshalling glue (see wasm-bindgen.md), and it is fast, but it is not free:
Passing a number (count_primes(u32)) is essentially free — it is already a WASM-native type.
Passing a &str or &[f64] copies the bytes from the JS heap into WASM linear memory. The cost is proportional to the length, paid once per call.
Returning a String or Vec<T> copies the bytes back out of linear memory into a fresh JS value.
Touching the DOM or a Web API from Rust is an import call back into JavaScript for every operation, each one a boundary crossing in the other direction.
The disastrous pattern is many tiny crossings. Calling normalize_one(value, max) a million times pays the call-and-marshal overhead a million times; the per-call compute (one division) is far smaller than the per-call overhead, so JavaScript’s in-engine loop wins. Calling normalize_all(values) once pays the boundary cost twice total (array in, array out) and runs the million divisions inside WASM at native speed. Same math, opposite outcome.
Note: This is the inverse of the usual JavaScript intuition. In JavaScript, “extract a helper function” is a cost-free refactor. Across the WASM boundary, “call a helper a million times” is a performance bug. Batch the work.
Put the two costs together and a clear rule emerges. WebAssembly pays off when compute per boundary crossing is high and the module is downloaded rarely (cached, reused across many operations). It loses when crossings are frequent and the work per crossing is trivial.
Workload
WASM verdict
Why
Image/video filters, codecs
Wins big
Megabytes processed per call; pure compute
Cryptography, hashing, compression
Wins big
CPU-bound, self-contained, one buffer in/out
Physics/game simulation, ray tracing
Wins big
Tight numeric loops, predictable memory
Parsing/validating large documents
Wins
One big string in, structured result out
Spreadsheet/formula engines
Wins
Heavy recompute, batched results
DOM manipulation (per element)
Usually loses
Every DOM op is a boundary crossing back to JS
Tiny per-event handlers (a click → one add)
Loses
Boundary + load cost dwarfs the work
String concatenation, JSON glue
Loses
V8 is already excellent; marshalling overhead added
First-paint-critical, tiny logic
Loses
The download/instantiate delay hurts more than it helps
.wasm binary, shrunk by wasm-bindgen DCE + wasm-opt
Minifier
terser / esbuild (in your bundler)
wasm-opt (in the Rust pipeline, before the bundler)
Size profiler
source-map-explorer, bundle analyzers
twiggy top / twiggy garbage
Startup cost
parse JS (incremental, lazy)
fetch + parse + instantiate the whole .wasm upfront
Function-call cost
cheap (in-engine)
cheap inside WASM; a boundary crossing has marshalling cost
Passing a 1M-element array
reference, no copy
copied into linear memory (once)
“Extract a helper” refactor
free
free inside Rust; a footgun across the boundary
Compute speed (tight loops)
JIT-compiled, fast but variable
AOT-compiled, fast and predictable
The deepest shift for a JavaScript developer: in JavaScript the runtime is already present and values are shared by reference, so calls are nearly free. In WebAssembly you ship a second compiled artifact, and the line between the two worlds is a real, measurable wall. Performance is won by shipping few bytes and crossing the wall rarely with big payloads.
Warning: Do not assume “Rust is faster, so rewriting in WASM is faster.” A WASM rewrite of chatty, DOM-bound, or trivially small JavaScript frequently ends up slower after you add download, instantiate, and per-call boundary costs. Measure the specific workload before committing.
The default cargo build produces a .wasm with full debug info — 2.5 MB in our example versus 17.7 KB optimized. wasm-pack build uses the release profile by default, so this bites mostly when you wire up your own build with a raw cargo build and forget --release. Always ship release; use wasm-pack build --dev only for fast local iteration.
The single most common WASM performance mistake is exporting a fine-grained function and calling it in a JavaScript loop. Each call re-crosses the boundary. The fix is to export a batch function that takes the whole array/buffer and loops inside Rust — normalize_all(&[f64]) instead of normalize_one(f64, f64) called a million times.
Trying to pass a slice of non-numeric types by reference
Newcomers reach for &[String] or &[SomeStruct] to “pass a batch,” expecting it to work like &[f64]. It does not compile. The real error from the crate above:
1
usewasm_bindgen::prelude::*;
2
3
#[wasm_bindgen]
4
pubfnjoin_all(words:&[String])->String{// does not compile (E0277)
5
words.join("")
6
}
1
error[E0277]: the trait bound `[String]: RefFromWasmAbi` is not satisfied
2
--> src/lib.rs:4:25
3
|
4
4 | pub fn join_all(words: &[String]) -> String {
5
| ^^^^^^^^ the trait `RefFromWasmAbi` is not implemented for `[String]`
6
|
7
= help: the following other types implement trait `RefFromWasmAbi`:
8
[MaybeUninit<f32>]
9
[MaybeUninit<f64>]
10
[MaybeUninit<i16>]
11
...
Only slices of the numeric primitives marshal by reference (zero-allocation, straight into linear memory). For a batch of strings or structs you either accept a Vec<String> (which wasm-bindgencan take, at the cost of converting each element), or — for richer/nested data — serialize the whole batch once with serde-wasm-bindgen (covered in wasm-bindgen.md). Either way, cross once.
Expecting a bulk memory operations error from wasm-opt (it usually does not happen)
A long-standing piece of folklore says you must pass --enable-bulk-memory-opt when running wasm-opt on Rust output, or it rejects the memory.copy/memory.fill instructions current Rust emits. With the documented toolchain that is no longer true: wasm-bindgen embeds a target_features section that wasm-opt reads to auto-enable the needed features, so wasm-opt -Os input.wasm -o output.wasm succeeds with no flag. You only see the memory.copy ... require bulk memory operations validation error on a module that lacks that section (an older toolchain, or a hand-assembled .wat); the fix there is to pass --enable-bulk-memory-opt.
If twiggy top shows only code[0], code[17], data[0] instead of demangled Rust names, you ran it against a wasm-opt’d binary whose name section was stripped. Profile the pre-wasm-optwasm-bindgen output (or build a copy that keeps debug names) to get attributable results, then ship the stripped one.
Assuming panic = "abort" and LTO always shrink the binary
It is tempting to bolt on every size knob. For our small module the aggressive profile (opt-level = "z", lto = true, codegen-units = 1, panic = "abort", strip = true) produced a larger final binary (19,327 bytes) than the default release profile plus wasm-opt -Oz (17,728 bytes). The knobs interact, and on a small module the gains can invert. Measure, do not assume — twiggy and a byte count are the arbiters.
Ship release, never debug.wasm-pack build does this by default; if you script cargo build yourself, always pass --release.
Let wasm-pack run wasm-opt -Os for you as the sane default. It balances size and speed and sets the right Binaryen feature flags automatically.
Optimize the wire, not just the disk. Serve .wasm with gzip or brotli (our 17.7 KB binary dropped to ~7-8 KB compressed). Configure your CDN/server for it; see deployment.md.
Design coarse-grained boundary APIs. One call that processes a whole buffer beats N calls that process one element. Pass &[f64]/&[u8] for numeric batches; serialize once for structured batches.
Profile with twiggy before you guess.twiggy top to find the biggest items, twiggy garbage to find unreachable waste. Attack the largest contributors first (often the allocator and the standard-library formatting/panic machinery).
Reduce allocation and formatting in hot paths.dlmalloc and core::fmt repeatedly show up as the biggest non-trivial code items; fewer String/Vec allocations and less format! shrink both size and runtime.
Reach for aggressive profile knobs only with a measurement in hand. Try opt-level = "z" and lto = true, then check the byte count — they do not always help, especially on small modules.
Keep console_error_panic_hook to dev builds. It improves panic messages during development but adds code; gate it behind a feature so it does not bloat production.
Cache the .wasm aggressively. The instantiate cost is paid once; a content-hashed, far-future-cached binary means repeat visits skip the download entirely, tilting the cost/benefit toward WASM.
A production pattern: a Rust crate that does one CPU-heavy job (here, normalizing a large signal buffer), built and tuned for size. The crate uses a coarse-grained boundary API and a dev-only panic hook gated behind a Cargo feature so it never reaches production.
The result count_primes(1_000_000) === 78498 is computed entirely inside WebAssembly, and normalize_all touches the boundary exactly twice regardless of array length. The shipped binary is ~17.7 KB on disk and ~7-8 KB over a brotli-compressed connection — a payload small enough that the CPU savings on the million-element pass clearly justify it.
Objective: See the debug-vs-release-vs-wasm-opt size collapse with your own eyes.
Instructions:
Create a cdylib crate exporting #[wasm_bindgen] pub fn count_primes(limit: u32) -> u32 (use the body from this page).
Build it three ways and record the .wasm byte size after each: a debug cargo build --target wasm32-unknown-unknown, a cargo build --release --target wasm32-unknown-unknown, and a full wasm-pack build --target web.
Run gzip -9 -c pkg/<name>_bg.wasm | wc -c to see the compressed wire size. State which stage saved the most bytes.
Solution
Terminal window
1
cargonew--libprimes&&cdprimes
2
cargoaddwasm-bindgen
3
# set [lib] crate-type = ["cdylib", "rlib"] in Cargo.toml
ls-ltarget/wasm32-unknown-unknown/release/primes.wasm# tens of KB
6
7
wasm-packbuild--targetweb
8
ls-lpkg/primes_bg.wasm# smaller still (wasm-opt)
9
gzip-9-cpkg/primes_bg.wasm|wc-c# ~half again
The debug → release step saves by far the most (dropping megabytes of debug info), followed by wasm-bindgen’s dead-code elimination and wasm-opt. Gzip then roughly halves the final binary on the wire. The headline lesson: the single most important thing is to ship a release build.
Objective: Use twiggy to find what is taking space, and confirm that profiling needs the un-stripped binary.
Instructions:
Take the crate from Exercise 1 (or add a reverse_words(&str) -> String function to pull in more standard-library code).
Install twiggy (cargo install twiggy). Run twiggy top -n 10 against the wasm-bindgen output (the pkg/<name>_bg.wasmbefore you re-run wasm-opt to strip names, or a fresh wasm-bindgen run with names kept).
Run twiggy top again against a wasm-opt’d copy and observe that the names become opaque (code[N]). Explain in one sentence why you profile the un-stripped binary but ship the stripped one.
Solution
Terminal window
1
cargoinstalltwiggy# one-time
2
3
# Build a names-bearing module (the wasm-bindgen output keeps the name section):
4
cargobuild--release--targetwasm32-unknown-unknown
5
wasm-bindgen--targetweb--out-dirnames\
6
target/wasm32-unknown-unknown/release/primes.wasm
7
8
twiggytop-n10names/primes_bg.wasm
Real output is dominated by the allocator and the name section:
twiggytop-n5shipped.wasm# items show as code[0], code[17], data[0], ...
You profile the un-stripped binary because twiggy needs the function-name section to attribute bytes to real Rust symbols; you ship the stripped binary because the name section is pure download weight the user never needs. The biggest contributor here is the default allocator (dlmalloc::malloc) — a hint that allocation-heavy code is expensive in both size and speed.
Objective: Refactor a per-element boundary call into a single batched crossing and explain the performance difference.
Instructions:
Start from a #[wasm_bindgen] pub fn square_one(x: f64) -> f64 intended to be called in a JavaScript loop over a million-element array.
Replace it with a batched #[wasm_bindgen] pub fn square_all(xs: &[f64]) -> Vec<f64> that performs the whole pass inside WASM.
In prose, explain how many boundary crossings each design incurs for an N-element array, and why the batched version is the one that lets WASM beat plain JavaScript.
Solution
src/lib.rs
1
usewasm_bindgen::prelude::*;
2
3
// BEFORE — chatty: one crossing PER element.
4
#[wasm_bindgen]
5
pubfnsquare_one(x:f64)->f64{
6
x*x
7
}
8
9
// AFTER — batched: the whole array crosses ONCE in, once out.
Calling square_one from a JavaScript loop over an N-element array performs N boundary crossings — each pays the call-and-marshal overhead, while the per-call work (one multiply) is tiny, so the overhead dominates and JavaScript’s in-engine loop is faster. Calling square_all(xs) performs exactly 2 crossings total regardless of N (the input slice copied into linear memory once, the result Vec<f64> copied out once); the million multiplies then run at native speed entirely inside WASM. The batched design amortizes the fixed boundary cost over the whole array, which is precisely the condition under which WebAssembly outperforms plain JavaScript.