Attack of the Clones Speeding Up Coroutine Compilation

$ clang++ -std=c++20 -g2 -c coroutine.cpp

Context

At Meta:

Large C++ codebase

Backend services, infrastructure, and more

Thousands of engineers writing, building, and debugging C++ daily

C++20 coroutines as the standard async pattern

Centralized developer infrastructure

Every change is felt globally

The Problem

To improve the dev workflow, we wanted to switch from -g1 to -g2 in dev builds — any binary debuggable out of the box.

This required work across the stack:

✓ Build-level optimizations

✓ Split DWARF rollout

✓ LLDB performance improvements

✗ Compile times for sources with coroutines

-g1

line tables

baseline

-g2

full debug info

2–3x slower

Simulations showed a significant end-to-end build time regression.

Profiling: Narrowing to a Pass

Profiling pointed to a single pass. A representative file:

-g1 ~3 ms

CoroSplitPass with line tables

-g2 ~306 ms

CoroSplitPass with full debug info

100x

slower

What is CoroSplitPass doing that is
so much more expensive with -g2?

CoroSplit: One Coroutine Becomes Many

coro [initial]

Function with
coroutine intrinsics

        transform
        →
      
        clone + transform
        →
      
        clone + transform
        →
      
        clone + transform
        →

coro [ramp]

Entry point, runs until first suspend

coro.resume

Coroutine body (state machine)

coro.destroy

Destroy coroutine state

coro.cleanup

Cleanup coroutine resources

3 clones
per coroutine*
×
100s
of coroutines
=
1000s of clones
via CloneFunctionInto

The slowdown was not in the coroutine transformation — it was in cloning.

* Switch ABI (C++). Other ABIs (e.g. Swift retcon) may produce more clones.

Function Cloning & Debug Info

Instructions, BBs

Explicit ownership

Debug info metadata

"Sea of nodes" — implicit ownership, discovered by traversal

To clone a function, traverse from…

DISubprogramthis fn
→
↘
DITypeDISubprogramreachable from fn (e.g. inlined)
DICompileUnitowning CU
→
DITypeDISubprogramDIGlobalretained types, imports, enums, …

Cloning policy

Clone only the function's own metadata.
Keep everything else not ours as-is.

Non-trivial for metadata: ownership is implicit, so implementing this policy requires eagerly traversing the graph to discover what's not ours.

The Root Cause: O(CompileUnit) Per Clone

How the cloning policy was implemented

// For each clone:

// 1. Traverse from subprogram — cascades into
//    processCompileUnit, pulling in everything
DIFinder.processSubprogram(SP);  // O(CU)

// 2. Identity-map "not ours" into VMap
for (auto *Ty : DIFinder.types())
  VMap[Ty] = Ty;
for (auto *S : DIFinder.subprograms())
  if (S != SP)
    VMap[S] = S;
// ...more categories

// 3. Clone the function
CloneFunctionInto(NewFn, OldFn, VMap, ...);

O(CU)

per clone

should be O(Function)

× processSubprogram cascades into the entire CU
× Pointer chasing across scattered heap nodes
× Clone policy baked into mutable VMap — can't share

The Optimization Journey: Four Iterations

Baseline

Eager VMap pre-population — O(CU) per clone

306 ms

Step 1: Decouple

Separate IdentityMD set, lazy identity-mapping via ValueMapper

221 ms

1.4x

Step 2: Reuse

Build the set once, share across all clones of a coroutine

68 ms

4.5x

Detour: Cache superseded

Module-level analysis pass — worked, but invalidation was unclear

17 ms

18x

✓

Step 3: Simplify

From set to predicate — no traversal. O(Function) restored.

3.8 ms

80x

Step 1: Decouple Clone Policy From Remapping State

Before: identity nodes stuffed into VMap

// Insert identity nodes into VMap upfront
for (Node in FindDebugInfoToIdentityMap(Fn)) {
  VMap[Node] = Node;  // expensive tracking
}
CloneFunctionInto(NewFn, Fn, VMap, ...);

After: separate immutable set, lazy mapping

// Build set separately (still O(CU))
IdentityMD = FindDebugInfoToIdentityMap(Fn);

// New parameter flows through to ValueMapper
CloneFunctionInto(NewFn, Fn, VMap,
    ..., &IdentityMD);
  ↓
ValueMapper(VMap, ..., &IdentityMD);
  // identity-maps matching metadata on first use

✓ IdentityMD is immutable — decoupled from VMap
✓ Can now be shared across clones (enables Step 2)
× Still traverses module to build the set per clone

306ms → 221ms 1.4x

Step 2: Reuse Policy Across Clones

Before: set rebuilt inside clone loop

for (CloneKind : {Ramp, Resume, Destroy}) {
  // O(CU) traversal — once per clone!
  IdentityMD = CollectCommonDebugInfo(Fn);
  CloneFunctionInto(Clone, Fn, VMap,
      IdentityMD, ...);
}

After: build once, share across clones

// O(CU) traversal — once per coroutine
CommonDI = CollectCommonDebugInfo(Fn);

for (CloneKind : {Ramp, Resume, Destroy}) {
  CloneFunctionInto(Clone, Fn, VMap,
      /*IdentityMD=*/CommonDI, ...);
}

Key observation: all clones of the same coroutine share the same module-level metadata. No reason to rebuild the set for each clone.

✓ Eliminates redundant traversals per coroutine
× Still O(CU) once per coroutine
× Hundreds of coroutines × O(CU) still adds up

306ms → 68ms 4.5x

Detour: Cache It superseded

DebugInfoCacheAnalysis

        Primed DIFinder per compile unit
      

        computed once at module level
      

    ↓ ↓ ↓ ↓

    copy & reuse

coro_1
.resume .destroy .cleanup

coro_2
.resume .destroy .cleanup

coro_3
.resume .destroy .cleanup

…

coro_N
.resume .destroy .cleanup

306ms → 17ms 18x

✓ Fast in practice
× No clean invalidation story

They say there are two hard problems in CS…

Thanks to @felipepiovezan for the invalidation question that prompted rethinking this approach

Step 3: Express Policy as a Predicate

Before
constMetadataSetTy *IdentityMD
→
After
constMetadataPredicate *IdentityMD

× No eager CU traversal
× No Set
× No Cache
✓ O(Function)

80x

306ms → 3.8ms

Conceptual sketch of the predicate

auto createIdentityMDPredicate(const Function &F, Changes) {
  auto *SP = F.getSubprogram();
  return [=](const Metadata *MD) {
    if (isa<DICompileUnit>(MD) || isa<DIType>(MD))
      return true;
    if (auto *S = dyn_cast<DISubprogram>(MD))
      return S != SP;
    if (auto *LS = dyn_cast<DILocalScope>(MD))
      return LS->getSubprogram() != SP;

    return false;
  };
}

Performance Results

CoroSplitPass time (sample file, full debug info)

Baseline

eager VMap

306 ms

Step 1

decouple policy

221 ms

Step 2

reuse policy

68 ms

Detour

cache

17 ms

Step 3

predicate

3.8 ms

-g1

line tables only

~3 ms

80x

faster

Full debug info, nearly as fast as -g1

Not Just Coroutines

CoroSplit

Primary motivation — high clone volume made the O(CU) cost painfully visible.

Also benefits

FunctionSpecialization — IPSCCP-driven specialization in default O1+ pipelines
ThinLTO function aliases — imported aliases are materialized via cloning
MemProf cloning — context-specific clones when enabled

* The fix landed in CloneFunctionInto / ValueMapper — generic same-module cloning infrastructure, not a coroutine-specific fast path.

Takeaways

Cloning was cheap. Ownership recovery was expensive.

The O(CU) cost came from reconstructing "not ours" from debug-info graph shape.

Separate remapping state from clone policy.

Once policy stopped living inside VMap, it became immutable, shareable, and finally a predicate.

The sharp edges are still in the representation.

Metadata ownership is still implicit, so cloning still relies on node-kind heuristics and special cases.

Thank You!

May the source be with you

Artem Pianykh

Meta — Dev Infra

pianykh.com/blog/talks/eurollvm26.html

References

RFC discourse.llvm.org — "Amortizing debug info processing cost in CoroSplit"

PR #109032 — Umbrella PR

PR #118627, #129147, #129148 — Key upstream patches