cats-eo

Benchmarks

JMH numbers from benchmarks/, in three groups:

  1. Optic micro-benchmarks — per-call overhead over hand-written field access, with Monocle alongside as the other optics library on the operations both implement.
  2. Integration: edit without decoding — one realistic Order document edited through circe / avro / jsoniter, measured against the decode-modify-encode you'd write by hand (and Monocle, which pays the same round-trip).
  3. PowerSeries composition — composed multi-focus traversal chains, measured against the hand-written copy + map and the equivalent Monocle composition.

How these were measured

All figures are average time per operation in nanoseconds (ns/op); lower is better. Absolute numbers vary by hardware — the ratios are the durable signal.

Every table comes from the reproducible benchmarks.yml CI workflow (always the same environment — ubuntu-22.04 / temurin@21, -f 3 -wi 3 -i 5, error half-widths ≤ ~5 %, mostly ≤ 2 %), and anyone with Actions access can re-run it. The figures are gathered across several runs of that workflow, though — each table is refreshed when its bench changes, so a table is internally consistent (one run) but two tables may come from different run instances. Since each run is a fresh shared VM, treat cross-table absolute comparisons loosely; within a table, and for the ratios everywhere, the signal holds.

JMH caveat. A shared CI runner still isn't a tuned, quiet desktop, so read these as reproducible directional data — the shape and the ratios — not sub-nanosecond truth. See Reproducing and benchmarks/README.md.

Optic micro-benchmarks

For a single operation the hand-written baseline is direct field access (order.id) or a copy — sub-nanosecond, the floor any optic adds overhead over. The honest question these tables answer is how little an optic costs over that floor. eo is the cats-eo method; Monocle is the same operation in the other optics library, shown alongside as a reference point — the ratio column, where present, is the two libraries head to head.

Lens (Tuple2 carrier) — shallow order.id and a depth-3 customer.address.street:

Operation eo Monocle ratio
get (id) 1.15 1.30 1.13×
replace (id) 5.16 5.13 0.99×
modify (id) 5.42 5.68 1.05×
modify (deep street) 36.68 31.70 0.86×

The cost over hand-written field access is essentially nil. GetReplaceLens stores get / enplace as plain fields and specialises its fused modify, so the hot path is a two-function composition — no (X, A) tuple allocation — and get lands within a few tenths of a ns of a bare order.id. Monocle sits in the same place. Monocle is faster on the depth-3 street modify (~1.2×): both rebuild the same three records through a fused composed optic (EO's inline GetReplaceLens.andThen, Monocle's composed Lens), and EO's per-hop get/enplace closure pair carries a touch more indirection than Monocle's purpose-built case class Lens — a small constant that first shows at depth 3.

Prism (Either carrier) — Option[Int] plus an Either[String, Int] Right-prism:

Operation eo Monocle
getOption Some 1.01 1.15
getOption None 1.01 1.15
reverseGet 2.64 2.69
Right getOption (Right) 2.80 2.87
Right getOption (Left) 1.16 1.30
Right reverseGet 2.63 2.68

Iso (Direct carrier) — Address ↔ (String, String, String, String):

Operation eo Monocle
get 4.89 4.93
reverseGet 4.29 4.40

BijectionIso stores both directions as plain fields — same shape and direct-call hot path as Monocle's case class Iso. (Direct is now an opaque type; the wrap / unwrap at the carrier boundary are transparent inline identities, so the hot path is unchanged.)

Optional (Affine carrier) — leaf Option[String], composed through a Nested0..6 Lens chain via cross-carrier .andThen (the Morph[Tuple2, Affine] auto-lifts each hop):

Operation eo Monocle
modify_0 (Some) 26.94 23.01
modify_0 (None) 1.51 0.98
replace_0 5.19 3.67
modify_3 46.37 66.24
modify_6 92.10 113.55
loyaltyId (Some) 31.38 20.56
loyaltyId (None) 1.67 1.07

At composition depth the per-hop cost stays low: modify_3 and modify_6 track the work of a hand-written nested copy with an Option match at the leaf, since the fused andThen composers are inline — each compose site splices distinct lambdas, so a deep chain doesn't reuse one shared andThen$$anonfun$ bytecode and trip C2's recursive-inline cap (before that, modify_6 was ~1.6× slower). Monocle lands a little behind here (modify_3 ~1.4×, modify_6 ~1.2×). Monocle is faster at the single-hit leaf (modify_0 / loyaltyId Some) — its Option-specialised internals shave the per-hit cost EO's generic Affine leaf carries to stay uniform across families. The loyaltyId rows are the canonical customer.loyaltyId: Option[String] focus (in memory — Avro omits it as a union), Some and None branches.

Getter / Modify (Direct / ModifyF) — depth-0/3/6 over Nested. Both families compose through the fused inline andThen on their concrete subclasses (Getter / Modify), so every row builds a composed optic on both sides and dispatches through it once — apples-to-apples with Monocle's composed Getter/Setter.

Depth Getter eo Getter Monocle Setter eo Setter Monocle
_0 0.95 0.54 2.34 2.34
_3 5.12 8.81 12.21 26.30
_6 11.30 27.50 26.42 52.26

At composition depth both families stay close to the hand-written baseline — a depth-N chain of .get calls, or a nested copy for the modifier. The lever is inline on the same-carrier andThen: each compose site splices a distinct lambda, so a depth-N chain becomes distinct synthetic methods per level. A plain def reuses one shared andThen$$anonfun$ bytecode across the chain, which C2 reads as recursion and caps (MaxRecursiveInlineLevel), leaving the deep tail as virtual Function1.apply; splicing distinct lambdas sidesteps that cap with no JVM flag. Modify (benchmarked as SetterBench; Monocle's family is still Setter) additionally sheds its per-hop ModifyF allocation (the fused Modify writes through modifyFn directly: depth-6 800→288 B/op). Monocle gets the same un-capped inlining from a fresh anonymous class per compose, and trails here (~1.7–2.4× at _3/_6). Monocle is faster at the scalar leaf (_0, order.id) by a few tenths of a ns — the sub-nanosecond floor where its specialised case classes shave the last field load and EO's generic carrier does not.

Fold / Traversal — Fold foldMap(identity) over List[Int]; Traversal each.modify over the canonical Order.lines (bump each line's qty), sweeping size:

Size Fold eo Fold Monocle Traversal eo Traversal Monocle Traversal speedup
8 20.1 20.4 122.4 364.0 3.0×
64 325.0 307.1 954.1 2 373.4 2.5×
512 4 535.0 4 494.7 8 153.8 38 221.2 4.7×

The hand-written reference is a bare foldLeft for the fold and an xs.map + copy for the traversal. Fold is on par with Monocle (0.98–1.06× across sizes) — both collapse to the same cats.Foldable[List].foldMap, one Monoid.combine per element, exactly what a hand-written foldMap does. EO's Fold.apply returns a concrete ForgetFold whose eager foldMap member folds straight through the captured Foldable[F] — the same stored-method shape as Monocle's Fold.fromFoldable. Before that specialisation EO routed every fold through the generic Optic.foldMap extension, and paid for it: ~5× at size 8 (109 → 20 ns), ~2.9× at 64, ~1.8× at 512. That overhead was a per-call constant (the ForgetfulFold[Forget[F]] summon, an intermediate S => M closure, and a box/unbox), not per-element — which is why it dominated small folds and faded into the asymptote on large ones. It was also invisible to -prof gc (the two were always allocation-identical at 14 080 B/op @512; escape analysis elides the closure, so the cost was cycles, not bytes — only CI's low-noise timing resolves it). The traversal is where the carriers genuinely diverge: EO's each (carrier MultiFocus[PSVec]) collects element references into a flat focus vector and rebuilds via Functor[PSVec].map, where Monocle wraps each element in Applicative[Id] — so EO tracks the hand-written map more closely and Monocle trails, widening to ~4.7× by 512 line items (each element pays a LineItem copy that dwarfs the carrier difference).

Integration: edit without decoding

The three integration benches share one realistic Order document — deep (customer.address.street, depth 3), wide (≥5 fields per level), and arrayed (lines) — so the JSON/Avro backends are directly comparable. Baselines per metric: eo (the cats-eo optic, plus the default Ior-bearing surface where it differs), naive (decode → copy → re-encode), monocle (decode → Monocle optic → encode), and backend-specific honest comparators (circe's hcursor / direct AST edits; jsoniter's hand-rolled partial-scan native).

The thesis, in one line: a pinpoint read/edit through cats-eo is flat in document size, while decode-modify-encode is linear — so the advantage is small on tiny payloads and enormous on large ones. The exception is whole-array work, where every approach must visit every element.

circe — JsonPrism / JsonTraversal over Json

Scalar deep edit, customer.address.street:

size eo (Unsafe) eo (Ior) hcursor direct naive monocle eo vs naive
8 1 050 1 059 1 080 1 013 3 748 3 769 3.6×
64 1 064 1 062 1 082 1 009 24 147 24 510 23×
512 1 058 1 059 1 068 1 015 220 535 212 179 208×

The edit is flat (~1.05 µs at any size); naive / monocle scale with the whole payload. direct JsonObject surgery is the fastest hand form (what JsonPrism mirrors); hcursor is competitive. On this scalar path the Ior surface is within noise of *Unsafe (≤~10 ns); the per-element Ior cost shows up only on the array traversal below.

Array write-traversal, lines[*].name:

size eo (Unsafe) eo (Ior) hcursor direct naive monocle
8 4 273 4 335 4 076 4 117 3 953 4 328
64 30 927 32 828 29 733 29 729 26 514 26 713
512 246 620 274 840 248 995 244 976 223 026 259 373

Honest result: a whole-array rewrite is O(elements) for everyone, so the cursor walk has no structural edge — but EO's JsonTraversal now lands right on the hand-written cursor / AST forms (eohcursordirect) and beats decode-modify-encode through Monocle, after the per-element path stopped re-allocating its walk state (JsonWalk uses flat index loops, not per-element Array→Vector/zip/foldRight). It still trails naive by ~1.1× at 512 — a bulk decode-map-encode is the most cache-friendly way to rewrite every element — so reach for JSON traversals for composition, diagnostics, and pinpoint edits, not raw whole-array throughput. (Avro below is the opposite, because its per-element decode is so costly.)

avro — AvroPrism / AvroTraversal over IndexedRecord

Scalar customer.address.street (loyaltyId is omitted — kindlings encodes Option as a union, navigated via .union[Branch]):

size eo read naive read eo modify naive modify read speedup modify speedup
8 37.7 3 171 143 4 501 84× 31×
64 37.8 18 905 145 25 774 500× 178×
512 37.7 143 746 144 220 923 3 813× 1 539×

The flat-vs-linear story at its extreme: a field read is ~38 ns regardless of record size, a ~3 800× gap by 512 line items. Array write lines[*].name:

size eo naive monocle eo speedup
8 669 4 676 5 020 7.0×
64 5 149 27 417 29 271 5.3×
512 40 177 228 882 265 722 5.7×

EO is ~5–7× faster than the hand-written decode-modify-encode even on a full-array write — Avro's per-element decode is costly enough that walking to the focused leaf and rebuilding one parent beats decoding every line item. (monocle here is that same round-trip, since Monocle has no IndexedRecord carrier.)

jsoniter — JsoniterPrism / JsoniterTraversal over Array[Byte]

native is a hand-rolled JsonReader that walks to the focus and skip()s every sibling — the optimum a jsoniter expert writes, which eo automates.

metric size eo native naive monocle eo vs naive
street read 8 199 823 1 966 1 965 9.9×
street read 64 199 4 037 12 431 12 467 62×
street read 512 199 30 993 103 689 104 042 521×
street modify 512 3 098 176 935 176 918 57×
lines[*].price fold 512 83 125 72 193 109 393 473 147 1.3×

The byte-walk read is flat at ~199 ns; even the hand-rolled native scan scales (it tokenises everything it passes), so eo beats it ~4× at size 8 and pulls further ahead with size. The modify splice is O(bytes) so it grows mildly but stays ~57× under a re-encode. The fold must touch every element, so eo lands near native and only ~1.3× under naive.

eo-jsoniter vs eo-circe (the JsoniterBench cross-EO comparison, on the canonical Order swept over size 8/64/512, CI): eo-circe parses the whole document each call (circeParse then drill), so it is O(size); the eo-jsoniter byte-walk is flat. One-shot scalar read ($.id): jsoniter ~41 ns at every size vs circe 4 502 → 221 163 ns, a gap that grows from ~110× (size 8) to ~5 300× (512) as the parse cost climbs. .replace/.modify $.id: jsoniter 97 → 2 778 ns (O(bytes) splice) vs circe ~9 500 → 432 000 ns, ~95–155×. The array lines[*].price fold narrows to ~4.7× — both must visit every element. (A deliberately-absent $.customer.absent miss costs eo-jsoniter a flat ~182 ns; there is no honest circe peer, since a typed codec can't drill an absent field.) The design spike covers the carrier choice and splice mechanics.

PowerSeries traversal with downstream composition

Three composed-traversal chains. naive is the hand-written copy + map equivalent — the baseline that matters. monocle is the same composition built with Monocle (Lens.andThen(Traversal.fromTraverse).andThen(Lens), Traversal.andThen(Prism), nested traversals) — these compositions are first-class there too, so it's a fair peer. Traversal.each (carrier MultiFocus[PSVec]) is EO's vehicle for all three; a @Setup guard asserts the three paths agree.

Chain (bench) N eo naive monocle eo ÷ naive
Lens → each → Lens (PowerSeriesBench) 4 133 29 194 4.6×
16 293 111 541 2.6×
64 903 427 2 060 2.1×
256 3 498 1 687 21 532 2.1×
1024 15 900 5 811 58 058 2.7×
4096 62 386 23 131 185 078 2.7×
5-hop tree (PowerSeriesNestedBench) 4 728 145 1 088 5.0×
16 1 425 414 2 382 3.4×
64 4 437 1 479 8 889 3.0×
256 16 392 5 261 92 646 3.1×
1024 68 326 22 588 254 182 3.0×
each → Prism, 50/50 hit (PowerSeriesPrismBench) 8 135 26 277 5.2×
32 447 84 951 5.3×
128 1 613 325 3 818 5.0×
512 7 254 1 294 32 455 5.6×
2048 29 246 5 590 97 415 5.2×

Both eo and the hand-written naive track O(N). The test is per-element cost (ns ÷ element, traversing N leaves — 4N for the nested tree): once the fixed per-op setup amortises past the smallest N, it flattens to a constant.

Chain eo ns ÷ element (N small → large) naive ns ÷ element log-log slope (eo / naive)
Lens → each → Lens 33 → 18 → 14 → 14 → 16 → 15 7.2 → 6.9 → 6.7 → 6.6 → 5.7 → 5.7 0.91 / 0.96
5-hop tree 46 → 22 → 17 → 16 → 17 9.1 → 6.5 → 5.8 → 5.1 → 5.5 0.83 / 0.91
each → Prism 17 → 14 → 13 → 14 → 14 3.2 → 2.6 → 2.5 → 2.5 → 2.7 0.98 / 0.97

A least-squares fit on log(time) vs log(N) gives slopes of 0.9–1.0 for every eo and naive series (a slope of 1 is exact linearity; the dip below 1 is the small-N fixed cost, which makes the curve concave — i.e. it rules out super-linear growth, not linearity). Error bars are ≤3.4% on all eo/naive points, so the flattening is signal, not noise. The eo ÷ naive overhead settles at ~2–3× on the dense Lens and nested chains and ~5× on the sparse Prism (whose miss branch carries inherent per-element plumbing). monocle is shown for reference; its scaling on the shared runner is noisier and not characterised here.

Under the hood the carrier pairs an existential leftover with a flat PSVec focus vector (Array[AnyRef] + an (offset, length) window), and two internal singleton markers (MultiFocusSingleton for always-hit Lens morphs, MultiFocusPSMaybeHit for maybe-miss Prism/Optional morphs) let the hot path write directly into pre-sized builders instead of allocating a per-element wrapper. The full mechanics and the −59 %…−67 % optimisation history are in the composition notes.

Plated recursion — read (universe) + write (transform) vs a hand visitor (and Monocle)

PlatedBench measures cats-eo's Plated against the hand-written recursion ("visitor") you'd write without optics — the baseline that matters — with Monocle's monocle.function.Plated alongside for reference, over three subjects: a normal-depth Expr tree, a degenerate deep Bin spine, and a normal-depth circe Json tree (via the universal Plated[Json]). The Plated carrier is PSVec-native, so neither path converts to a List and back.

Op Subject eo monocle visitor eo ÷ visitor
universe Expr balanced 113 953 1 702 000 55 248 2.1×
Json balanced 198 254 2 051 033 130 461 1.5×
Bin deep spine 121 068 SO 60 927 2.0×
transform Expr balanced 153 129 184 118 73 443 2.1×
Bin deep spine 132 753 SO 44 744 3.0×

(ns/op at n=512, from the reproducible CI benchmarks workflow — 3-fork on the shared runner, so absolute numbers run higher than a quiet desktop; the ratios are the signal and sub-~1.5× differences aren't meaningful. SO = StackOverflowError.)

Three results:

The residual gap to the visitor is the carrier materialisation: even on the fast recursive path EO allocates a PSVec of children + an out array per internal node, where a hand visitor fuses extract-and-rebuild into one new Node(go(l), go(r)) and allocates neither — plus the heap stack the deep fallback uses (so it never overflows; the visitor and Monocle both do on the spine). Closing the last ~2–3× would mean fusing the recursion into the plate[S] macro — but that emits a function, not an Optic, which would break the .andThen composition everywhere relies on, so it's deliberately not done.

Reproducing

The integration tables are produced by the Benchmarks CI workflow (.github/workflows/benchmarks.yml, manual workflow_dispatch) — it uploads a jmh-results.json artifact and renders a summary table to the run page.

Locally, the bench / benchQuick sbt aliases bake in the standard config (append a JMH filter to scope):

sbt bench                          # -f 3 -wi 3 -i 5, whole suite
sbt "bench .*OrderAvroBench.*"     # one class
sbt benchQuick                     # -f 1 -wi 2 -i 3, fast + noisy

GC and stack profilers help when a number is surprising:

sbt "bench -prof gc .*LensBench.*"
sbt "bench -prof stack .*PowerSeries.*"