Summary
SCCACHE_GHA_USE_PREPROCESSOR_CACHE_MODE=true is enabled in build-wheel.yml#L58, but the preprocessor cache has never served a hit in any CI run I've sampled (PR 2218 today, main today 00:23 UTC, main on 2026-05-27). All compiled translation units register as Preprocessor cache misses. The downstream object cache still hits ~100% via GHAC, so wheel build correctness is fine — but the preprocessor stage is pure overhead.
Evidence
From job Build linux-64, CUDA 13.3.0 / py3.12 (PR #2218):
cuda.bindings cuda.core cuda.core (prev CTK)
Preprocessor hits 0 0 0
Preprocessor misses 29 42 42
Cache hits 29 42 40
Cache hits rate 100.00 % 100.00 % 95.24 %
Avg preprocessor miss 0.094 s 17.993 s 14.148 s
Same pattern in main run 27516906885 / job 81327125521 and main run 26482898891 / job 77984135920 (3 weeks ago) — 0 hits in every case.
Root cause
pip's per-run build-isolation overlay lives at /tmp/pip-build-env-<RANDOM>/overlay/... and that random path lands in -I arguments of every c++ invocation, e.g.:
c++ ... -I/tmp/build-env-_283dvkj/include ... -c build/cython/cuda/bindings/runtime.cpp
rapidsai/sccache hashes the raw compiler arguments into the preprocessor cache key without any basedir stripping for flag values — see preprocessor_cache_entry_hash_key in preprocessor_cache.rs#L493-L497. Different random -I path each CI run → different preprocessor key → guaranteed miss. The object cache survives because its key is derived from the preprocessor output, which is byte-identical between runs (the overlay dir contributes no headers to the expansion).
SCCACHE_BASEDIRS would not help either: it strips paths from the input file only, not from flag values. (ccache's CCACHE_BASEDIR does both — sccache lacks the equivalent.)
Perf implication
Lookups aren't free. In the linked job:
cuda.core: 42 misses × ~18 s avg lookup ≈ ~12.6 min wasted
cuda.core (prev CTK): 42 × ~14 s ≈ ~9.9 min
So preprocessor cache mode currently costs roughly 20 min per Linux job without delivering a single hit. Multiplied across the build matrix (6 Python versions × 2 arches × 2 CTKs in the wheel jobs), this is non-trivial CI cost.
Proposed fix
Quick win: stop setting SCCACHE_GHA_USE_PREPROCESSOR_CACHE_MODE until the underlying issue is addressed. PR to follow.
Real fix (separate work): switch cibuildwheel to --no-build-isolation, so pip's overlay dir never appears in -I to begin with. That also unlocks reusing the scikit-build / Cython build dir across runs.
Summary
SCCACHE_GHA_USE_PREPROCESSOR_CACHE_MODE=trueis enabled in build-wheel.yml#L58, but the preprocessor cache has never served a hit in any CI run I've sampled (PR 2218 today, main today 00:23 UTC, main on 2026-05-27). All compiled translation units register asPreprocessor cache misses. The downstream object cache still hits ~100% via GHAC, so wheel build correctness is fine — but the preprocessor stage is pure overhead.Evidence
From job
Build linux-64, CUDA 13.3.0 / py3.12(PR #2218):Same pattern in main run 27516906885 / job 81327125521 and main run 26482898891 / job 77984135920 (3 weeks ago) — 0 hits in every case.
Root cause
pip's per-run build-isolation overlay lives at
/tmp/pip-build-env-<RANDOM>/overlay/...and that random path lands in-Iarguments of everyc++invocation, e.g.:rapidsai/sccache hashes the raw compiler arguments into the preprocessor cache key without any basedir stripping for flag values — see
preprocessor_cache_entry_hash_keyin preprocessor_cache.rs#L493-L497. Different random-Ipath each CI run → different preprocessor key → guaranteed miss. The object cache survives because its key is derived from the preprocessor output, which is byte-identical between runs (the overlay dir contributes no headers to the expansion).SCCACHE_BASEDIRSwould not help either: it strips paths from the input file only, not from flag values. (ccache'sCCACHE_BASEDIRdoes both — sccache lacks the equivalent.)Perf implication
Lookups aren't free. In the linked job:
cuda.core: 42 misses × ~18 s avg lookup ≈ ~12.6 min wastedcuda.core (prev CTK): 42 × ~14 s ≈ ~9.9 minSo preprocessor cache mode currently costs roughly 20 min per Linux job without delivering a single hit. Multiplied across the build matrix (6 Python versions × 2 arches × 2 CTKs in the wheel jobs), this is non-trivial CI cost.
Proposed fix
Quick win: stop setting
SCCACHE_GHA_USE_PREPROCESSOR_CACHE_MODEuntil the underlying issue is addressed. PR to follow.Real fix (separate work): switch cibuildwheel to
--no-build-isolation, so pip's overlay dir never appears in-Ito begin with. That also unlocks reusing the scikit-build / Cython build dir across runs.