Benchmarks
JMH numbers from benchmarks/,
in three groups:
- Optic micro-benchmarks — per-call overhead over hand-written field access, with Monocle alongside as the other optics library on the operations both implement.
- Integration: edit without decoding —
one realistic
Orderdocument edited through circe / avro / jsoniter, measured against the decode-modify-encode you'd write by hand (and Monocle, which pays the same round-trip). - PowerSeries composition —
composed multi-focus traversal chains, measured against the hand-written
copy+mapand the equivalent Monocle composition.
How these were measured
All figures are average time per operation in nanoseconds (ns/op); lower is better. Absolute numbers vary by hardware — the ratios are the durable signal.
Every table comes from the reproducible benchmarks.yml CI workflow (always
the same environment — ubuntu-22.04 / temurin@21, -f 3 -wi 3 -i 5, error
half-widths ≤ ~5 %, mostly ≤ 2 %), and anyone with Actions access can re-run it.
The figures are gathered across several runs of that workflow, though — each
table is refreshed when its bench changes, so a table is internally consistent
(one run) but two tables may come from different run instances. Since each run is
a fresh shared VM, treat cross-table absolute comparisons loosely; within a
table, and for the ratios everywhere, the signal holds.
JMH caveat. A shared CI runner still isn't a tuned, quiet desktop, so read
these as reproducible directional data — the shape and the ratios — not
sub-nanosecond truth. See Reproducing and
benchmarks/README.md.
Optic micro-benchmarks
For a single operation the hand-written baseline is direct field access
(order.id) or a copy — sub-nanosecond, the floor any optic adds overhead
over. The honest question these tables answer is how little an optic costs over
that floor. eo is the cats-eo method; Monocle is the same operation in the
other optics library, shown alongside as a reference point — the ratio column,
where present, is the two libraries head to head.
Lens (Tuple2 carrier) — shallow order.id and a depth-3 customer.address.street:
| Operation | eo | Monocle | ratio |
|---|---|---|---|
get (id) |
1.15 | 1.30 | 1.13× |
replace (id) |
5.16 | 5.13 | 0.99× |
modify (id) |
5.42 | 5.68 | 1.05× |
modify (deep street) |
36.68 | 31.70 | 0.86× |
The cost over hand-written field access is essentially nil. GetReplaceLens
stores get / enplace as plain fields and specialises its fused modify, so
the hot path is a two-function composition — no (X, A) tuple allocation — and
get lands within a few tenths of a ns of a bare order.id. Monocle sits in the
same place. Monocle is faster on the depth-3 street modify (~1.2×): both
rebuild the same three records through a fused composed optic (EO's inline
GetReplaceLens.andThen, Monocle's composed Lens), and EO's per-hop
get/enplace closure pair carries a touch more indirection than Monocle's
purpose-built case class Lens — a small constant that first shows at depth 3.
Prism (Either carrier) — Option[Int] plus an Either[String, Int]
Right-prism:
| Operation | eo | Monocle |
|---|---|---|
getOption Some |
1.01 | 1.15 |
getOption None |
1.01 | 1.15 |
reverseGet |
2.64 | 2.69 |
Right getOption (Right) |
2.80 | 2.87 |
Right getOption (Left) |
1.16 | 1.30 |
Right reverseGet |
2.63 | 2.68 |
Iso (Direct carrier) — Address ↔ (String, String, String, String):
| Operation | eo | Monocle |
|---|---|---|
get |
4.89 | 4.93 |
reverseGet |
4.29 | 4.40 |
BijectionIso stores both directions as plain fields — same shape and
direct-call hot path as Monocle's case class Iso. (Direct is now an opaque
type; the wrap / unwrap at the carrier boundary are transparent inline
identities, so the hot path is unchanged.)
Optional (Affine carrier) — leaf Option[String], composed through a
Nested0..6 Lens chain via cross-carrier .andThen (the Morph[Tuple2, Affine]
auto-lifts each hop):
| Operation | eo | Monocle |
|---|---|---|
modify_0 (Some) |
26.94 | 23.01 |
modify_0 (None) |
1.51 | 0.98 |
replace_0 |
5.19 | 3.67 |
modify_3 |
46.37 | 66.24 |
modify_6 |
92.10 | 113.55 |
loyaltyId (Some) |
31.38 | 20.56 |
loyaltyId (None) |
1.67 | 1.07 |
At composition depth the per-hop cost stays low: modify_3 and modify_6 track
the work of a hand-written nested copy with an Option match at the leaf, since
the fused andThen composers are inline — each compose site splices distinct
lambdas, so a deep chain doesn't reuse one shared andThen$$anonfun$ bytecode and
trip C2's recursive-inline cap (before that, modify_6 was ~1.6× slower). Monocle
lands a little behind here (modify_3 ~1.4×, modify_6 ~1.2×). Monocle is
faster at the single-hit leaf (modify_0 / loyaltyId Some) — its
Option-specialised internals shave the per-hit cost EO's generic Affine leaf
carries to stay uniform across families. The loyaltyId rows are the canonical
customer.loyaltyId: Option[String] focus (in memory — Avro omits it as a union),
Some and None branches.
Getter / Modify (Direct / ModifyF) — depth-0/3/6 over Nested. Both
families compose through the fused inline andThen on their concrete
subclasses (Getter / Modify), so every row builds a composed
optic on both sides and dispatches through it once — apples-to-apples with
Monocle's composed Getter/Setter.
| Depth | Getter eo | Getter Monocle | Setter eo | Setter Monocle |
|---|---|---|---|---|
_0 |
0.95 | 0.54 | 2.34 | 2.34 |
_3 |
5.12 | 8.81 | 12.21 | 26.30 |
_6 |
11.30 | 27.50 | 26.42 | 52.26 |
At composition depth both families stay close to the hand-written baseline — a
depth-N chain of .get calls, or a nested copy for the modifier. The lever is
inline on the same-carrier andThen: each compose site splices a distinct
lambda, so a depth-N chain becomes distinct synthetic methods per level. A plain
def reuses one shared andThen$$anonfun$ bytecode across the chain, which C2
reads as recursion and caps (MaxRecursiveInlineLevel), leaving the deep tail as
virtual Function1.apply; splicing distinct lambdas sidesteps that cap with no
JVM flag. Modify (benchmarked as SetterBench; Monocle's family is still Setter) additionally sheds its per-hop ModifyF allocation (the fused
Modify writes through modifyFn directly: depth-6 800→288 B/op). Monocle
gets the same un-capped inlining from a fresh anonymous class per compose, and
trails here (~1.7–2.4× at _3/_6). Monocle is faster at the scalar leaf
(_0, order.id) by a few tenths of a ns — the sub-nanosecond floor where its
specialised case classes shave the last field load and EO's generic carrier does
not.
Fold / Traversal — Fold foldMap(identity) over List[Int]; Traversal
each.modify over the canonical Order.lines (bump each line's qty), sweeping
size:
| Size | Fold eo | Fold Monocle | Traversal eo | Traversal Monocle | Traversal speedup |
|---|---|---|---|---|---|
| 8 | 20.1 | 20.4 | 122.4 | 364.0 | 3.0× |
| 64 | 325.0 | 307.1 | 954.1 | 2 373.4 | 2.5× |
| 512 | 4 535.0 | 4 494.7 | 8 153.8 | 38 221.2 | 4.7× |
The hand-written reference is a bare foldLeft for the fold and an xs.map +
copy for the traversal. Fold is on par with Monocle (0.98–1.06× across sizes)
— both collapse to the same cats.Foldable[List].foldMap, one Monoid.combine
per element, exactly what a hand-written foldMap does. EO's Fold.apply returns a
concrete ForgetFold whose eager foldMap member folds straight through the
captured Foldable[F] — the same stored-method shape as Monocle's
Fold.fromFoldable. Before that specialisation EO routed every fold through the
generic Optic.foldMap extension, and paid for it: ~5× at size 8 (109 → 20 ns),
~2.9× at 64, ~1.8× at 512. That overhead was a per-call constant (the
ForgetfulFold[Forget[F]] summon, an intermediate S => M closure, and a
box/unbox), not per-element — which is why it dominated small folds and faded
into the asymptote on large ones. It was also invisible to -prof gc (the two were
always allocation-identical at 14 080 B/op @512; escape analysis elides the
closure, so the cost was cycles, not bytes — only CI's low-noise timing resolves
it). The traversal is where the carriers genuinely diverge: EO's each (carrier
MultiFocus[PSVec]) collects element references into a flat focus vector and
rebuilds via Functor[PSVec].map, where Monocle wraps each element in
Applicative[Id] — so EO tracks the hand-written map more closely and
Monocle trails, widening to ~4.7× by 512 line items (each element pays a LineItem
copy that dwarfs the carrier difference).
Integration: edit without decoding
The three integration benches share one realistic Order document — deep
(customer.address.street, depth 3), wide (≥5 fields per level), and arrayed
(lines) — so the JSON/Avro backends are directly comparable. Baselines per
metric: eo (the cats-eo optic, plus the default Ior-bearing surface where it
differs), naive (decode → copy → re-encode), monocle (decode → Monocle
optic → encode), and backend-specific honest comparators (circe's hcursor /
direct AST edits; jsoniter's hand-rolled partial-scan native).
The thesis, in one line: a pinpoint read/edit through cats-eo is flat in document size, while decode-modify-encode is linear — so the advantage is small on tiny payloads and enormous on large ones. The exception is whole-array work, where every approach must visit every element.
circe — JsonPrism / JsonTraversal over Json
Scalar deep edit, customer.address.street:
| size | eo (Unsafe) | eo (Ior) | hcursor | direct | naive | monocle | eo vs naive |
|---|---|---|---|---|---|---|---|
| 8 | 1 050 | 1 059 | 1 080 | 1 013 | 3 748 | 3 769 | 3.6× |
| 64 | 1 064 | 1 062 | 1 082 | 1 009 | 24 147 | 24 510 | 23× |
| 512 | 1 058 | 1 059 | 1 068 | 1 015 | 220 535 | 212 179 | 208× |
The edit is flat (~1.05 µs at any size); naive / monocle scale with the
whole payload. direct JsonObject surgery is the fastest hand form (what
JsonPrism mirrors); hcursor is competitive. On this scalar path the Ior
surface is within noise of *Unsafe (≤~10 ns); the per-element Ior cost shows
up only on the array traversal below.
Array write-traversal, lines[*].name:
| size | eo (Unsafe) | eo (Ior) | hcursor | direct | naive | monocle |
|---|---|---|---|---|---|---|
| 8 | 4 273 | 4 335 | 4 076 | 4 117 | 3 953 | 4 328 |
| 64 | 30 927 | 32 828 | 29 733 | 29 729 | 26 514 | 26 713 |
| 512 | 246 620 | 274 840 | 248 995 | 244 976 | 223 026 | 259 373 |
Honest result: a whole-array rewrite is O(elements) for everyone, so the cursor
walk has no structural edge — but EO's JsonTraversal now lands right on the
hand-written cursor / AST forms (eo ≈ hcursor ≈ direct) and beats
decode-modify-encode through Monocle, after the per-element path stopped
re-allocating its walk state (JsonWalk uses flat index loops, not per-element
Array→Vector/zip/foldRight). It still trails naive by ~1.1× at 512 —
a bulk decode-map-encode is the most cache-friendly way to rewrite every
element — so reach for JSON traversals for composition, diagnostics, and
pinpoint edits, not raw whole-array throughput. (Avro below is the opposite,
because its per-element decode is so costly.)
avro — AvroPrism / AvroTraversal over IndexedRecord
Scalar customer.address.street (loyaltyId is omitted — kindlings encodes
Option as a union, navigated via .union[Branch]):
| size | eo read | naive read | eo modify | naive modify | read speedup | modify speedup |
|---|---|---|---|---|---|---|
| 8 | 37.7 | 3 171 | 143 | 4 501 | 84× | 31× |
| 64 | 37.8 | 18 905 | 145 | 25 774 | 500× | 178× |
| 512 | 37.7 | 143 746 | 144 | 220 923 | 3 813× | 1 539× |
The flat-vs-linear story at its extreme: a field read is ~38 ns regardless of
record size, a ~3 800× gap by 512 line items. Array write lines[*].name:
| size | eo | naive | monocle | eo speedup |
|---|---|---|---|---|
| 8 | 669 | 4 676 | 5 020 | 7.0× |
| 64 | 5 149 | 27 417 | 29 271 | 5.3× |
| 512 | 40 177 | 228 882 | 265 722 | 5.7× |
EO is ~5–7× faster than the hand-written decode-modify-encode even on a full-array
write — Avro's per-element decode is costly enough that walking to the focused
leaf and rebuilding one parent beats decoding every line item. (monocle here is
that same round-trip, since Monocle has no IndexedRecord carrier.)
jsoniter — JsoniterPrism / JsoniterTraversal over Array[Byte]
native is a hand-rolled JsonReader that walks to the focus and skip()s
every sibling — the optimum a jsoniter expert writes, which eo automates.
| metric | size | eo | native | naive | monocle | eo vs naive |
|---|---|---|---|---|---|---|
street read |
8 | 199 | 823 | 1 966 | 1 965 | 9.9× |
street read |
64 | 199 | 4 037 | 12 431 | 12 467 | 62× |
street read |
512 | 199 | 30 993 | 103 689 | 104 042 | 521× |
street modify |
512 | 3 098 | — | 176 935 | 176 918 | 57× |
lines[*].price fold |
512 | 83 125 | 72 193 | 109 393 | 473 147 | 1.3× |
The byte-walk read is flat at ~199 ns; even the hand-rolled native scan
scales (it tokenises everything it passes), so eo beats it ~4× at size 8 and
pulls further ahead with size. The modify splice is O(bytes) so it grows mildly
but stays ~57× under a re-encode. The fold must touch every element, so eo
lands near native and only ~1.3× under naive.
eo-jsoniter vs eo-circe (the JsoniterBench cross-EO comparison, on the
canonical Order swept over size 8/64/512, CI): eo-circe parses the whole
document each call (circeParse then drill), so it is O(size); the eo-jsoniter
byte-walk is flat. One-shot scalar read ($.id): jsoniter ~41 ns at every size
vs circe 4 502 → 221 163 ns, a gap that grows from ~110× (size 8) to ~5 300×
(512) as the parse cost climbs. .replace/.modify $.id: jsoniter 97 → 2 778
ns (O(bytes) splice) vs circe ~9 500 → 432 000 ns, ~95–155×. The array
lines[*].price fold narrows to ~4.7× — both must visit every element. (A
deliberately-absent $.customer.absent miss costs eo-jsoniter a flat ~182 ns;
there is no honest circe peer, since a typed codec can't drill an absent field.)
The
design spike
covers the carrier choice and splice mechanics.
PowerSeries traversal with downstream composition
Three composed-traversal chains. naive is the hand-written copy + map
equivalent — the baseline that matters. monocle is the same composition built
with Monocle (Lens.andThen(Traversal.fromTraverse).andThen(Lens),
Traversal.andThen(Prism), nested traversals) — these compositions are first-class
there too, so it's a fair peer. Traversal.each (carrier MultiFocus[PSVec]) is
EO's vehicle for all three; a @Setup guard asserts the three paths agree.
| Chain (bench) | N | eo | naive | monocle | eo ÷ naive |
|---|---|---|---|---|---|
Lens → each → Lens (PowerSeriesBench) |
4 | 133 | 29 | 194 | 4.6× |
| 16 | 293 | 111 | 541 | 2.6× | |
| 64 | 903 | 427 | 2 060 | 2.1× | |
| 256 | 3 498 | 1 687 | 21 532 | 2.1× | |
| 1024 | 15 900 | 5 811 | 58 058 | 2.7× | |
| 4096 | 62 386 | 23 131 | 185 078 | 2.7× | |
5-hop tree (PowerSeriesNestedBench) |
4 | 728 | 145 | 1 088 | 5.0× |
| 16 | 1 425 | 414 | 2 382 | 3.4× | |
| 64 | 4 437 | 1 479 | 8 889 | 3.0× | |
| 256 | 16 392 | 5 261 | 92 646 | 3.1× | |
| 1024 | 68 326 | 22 588 | 254 182 | 3.0× | |
each → Prism, 50/50 hit (PowerSeriesPrismBench) |
8 | 135 | 26 | 277 | 5.2× |
| 32 | 447 | 84 | 951 | 5.3× | |
| 128 | 1 613 | 325 | 3 818 | 5.0× | |
| 512 | 7 254 | 1 294 | 32 455 | 5.6× | |
| 2048 | 29 246 | 5 590 | 97 415 | 5.2× |
Both eo and the hand-written naive track O(N). The test is per-element
cost (ns ÷ element, traversing N leaves — 4N for the nested tree): once the
fixed per-op setup amortises past the smallest N, it flattens to a constant.
| Chain | eo ns ÷ element (N small → large) |
naive ns ÷ element |
log-log slope (eo / naive) |
|---|---|---|---|
Lens → each → Lens |
33 → 18 → 14 → 14 → 16 → 15 | 7.2 → 6.9 → 6.7 → 6.6 → 5.7 → 5.7 | 0.91 / 0.96 |
| 5-hop tree | 46 → 22 → 17 → 16 → 17 | 9.1 → 6.5 → 5.8 → 5.1 → 5.5 | 0.83 / 0.91 |
each → Prism |
17 → 14 → 13 → 14 → 14 | 3.2 → 2.6 → 2.5 → 2.5 → 2.7 | 0.98 / 0.97 |
A least-squares fit on log(time) vs log(N) gives slopes of 0.9–1.0 for
every eo and naive series (a slope of 1 is exact linearity; the dip below 1 is the
small-N fixed cost, which makes the curve concave — i.e. it rules out
super-linear growth, not linearity). Error bars are ≤3.4% on all eo/naive
points, so the flattening is signal, not noise. The eo ÷ naive overhead settles
at ~2–3× on the dense Lens and nested chains and ~5× on the sparse Prism (whose
miss branch carries inherent per-element plumbing). monocle is shown for
reference; its scaling on the shared runner is noisier and not characterised here.
Under the hood the carrier pairs an existential leftover with a flat PSVec
focus vector (Array[AnyRef] + an (offset, length) window), and two internal
singleton markers (MultiFocusSingleton for always-hit Lens morphs,
MultiFocusPSMaybeHit for maybe-miss Prism/Optional morphs) let the hot path
write directly into pre-sized builders instead of allocating a per-element
wrapper. The full mechanics and the −59 %…−67 % optimisation history are in the
composition notes.
Plated recursion — read (universe) + write (transform) vs a hand visitor (and Monocle)
PlatedBench measures cats-eo's Plated against the hand-written recursion
("visitor") you'd write without optics — the baseline that matters — with
Monocle's monocle.function.Plated alongside for reference, over three
subjects: a normal-depth Expr tree, a degenerate deep Bin spine, and a
normal-depth circe Json tree (via the universal Plated[Json]). The Plated
carrier is PSVec-native, so neither path converts to a List and back.
| Op | Subject | eo | monocle | visitor | eo ÷ visitor |
|---|---|---|---|---|---|
universe |
Expr balanced |
113 953 | 1 702 000 | 55 248 | 2.1× |
Json balanced |
198 254 | 2 051 033 | 130 461 | 1.5× | |
Bin deep spine |
121 068 | SO | 60 927 | 2.0× | |
transform |
Expr balanced |
153 129 | 184 118 | 73 443 | 2.1× |
Bin deep spine |
132 753 | SO | 44 744 | 3.0× |
(ns/op at n=512, from the reproducible CI benchmarks workflow — 3-fork on the
shared runner, so absolute numbers run higher than a quiet desktop; the ratios are
the signal and sub-~1.5× differences aren't meaningful. SO = StackOverflowError.)
Three results:
universerivals the hand-written visitor. Reading children straight off the PSVec carrier (noListround-trip, explicit worklist) puts the JSON subject ~1.5× the bare visitor andExprwithin ~2× — close to the recursion you'd write by hand, while staying composable through.andThen. (Monocle is ~10–15× slower here, its lazy-#:::append going quadratic, and is not stack-safe: bothuniverseandtransformStackOverflowErroron the degenerate spine at depth ≳2048. EO clears the spine at every size here and at 100k in the stack-safety test, so the deep rows compare EO against the visitor only.)transformsits within ~2–3× of the hand-written visitor and is stack-safe. It went from anEvaltrampoline (was ~10× slower) to a hybrid: a direct call-stack recursion (≈ a hand-written rebuild — no per-node heapFrame) while shallow, handing any subtree past a depth bound to the heap-stack machine so a degenerate spine still can't overflow. (rewritekeeps its owncats.Evaltrampoline — it stays stack-safe on both the descent and a long re-fire chain, which a synchronous machine would put back on the call stack.) WithchildrenVec/rebuild(noto/fromtuple per node) and leaves applied in place, allocation is ~80 B/node onExpr(visitor is 44) and ~56 B/node on the deep spine. Both paths sit within ~2–3× of the bare visitor.
The residual gap to the visitor is the carrier materialisation: even on the
fast recursive path EO allocates a PSVec of children + an out array per
internal node, where a hand visitor fuses extract-and-rebuild into one
new Node(go(l), go(r)) and allocates neither — plus the heap stack the deep
fallback uses (so it never overflows; the visitor and Monocle both do on the
spine). Closing the last ~2–3× would mean fusing the recursion into the plate[S]
macro — but that emits a function, not an Optic, which would break the
.andThen composition everywhere relies on, so it's deliberately not done.
Reproducing
The integration tables are produced by the Benchmarks CI workflow
(.github/workflows/benchmarks.yml, manual workflow_dispatch) — it uploads a
jmh-results.json artifact and renders a summary table to the run page.
Locally, the bench / benchQuick sbt aliases bake in the standard config
(append a JMH filter to scope):
sbt bench # -f 3 -wi 3 -i 5, whole suite
sbt "bench .*OrderAvroBench.*" # one class
sbt benchQuick # -f 1 -wi 2 -i 3, fast + noisy
GC and stack profilers help when a number is surprising:
sbt "bench -prof gc .*LensBench.*"
sbt "bench -prof stack .*PowerSeries.*"