Skip to content

Benchmarks

fast-vollib ships benchmark and validation artifacts, but the exact throughput you observe depends heavily on hardware, backend choice, package versions, and whether JIT compilation has already been warmed up. This page focuses on what is reproducible from the repository and what should be cited from upstream baselines.


Shipped benchmark artifacts

The repository includes four reproducibility entry points:

  • notebooks/fast_vollib_comparison.ipynb Interactive notebook for pricing, IV, and Greeks comparisons.
  • scripts/benchmark.py Quick local timing script for pricing, IV, and all-Greeks runs.
  • scripts/compare_against_py_vollib_vectorized.py Numerical parity check against an installed py_vollib_vectorized.
  • scripts/wrds_benchmark.py WRDS OptionMetrics validation script for local institutional datasets.

The notebook is the most complete entry point. The scripts are useful for quick CI-style or terminal-based checks.


Upstream baseline reference

py_vollib_vectorized publishes its own benchmarking page here:

Those published upstream timings are a useful historical CPU baseline for the Python ecosystem. The table below reproduces the values reported on that page.

Contracts pandas.apply for-loop iterrows list-comp py_vollib_vectorized
10 0.037 s 0.023 s 0.008 s 0.023 s 0.004 s
100 0.069 s 0.226 s 0.078 s 0.225 s 0.002 s
1,000 0.652 s 2.322 s 0.797 s 2.291 s 0.003 s
10,000 6.618 s 23.350 s 8.186 s 23.146 s 0.011 s
100,000 60 s cap 60 s cap 60 s cap 60 s cap 0.095 s

One important implementation note: the current py_vollib_vectorized package uses py_lets_be_rational / Peter Jaeckel's Let’s Be Rational machinery for implied volatility. It should not be described as a simple numpy.vectorize(brentq) wrapper.


Running the local fast-vollib benchmarks

Quick timing pass

python scripts/benchmark.py

This script prints timing summaries for:

  • Black-Scholes pricing
  • implied volatility inversion
  • get_all_greeks

It also reports NumPy-only timings and a larger-batch timing pass.

Numerical parity against py_vollib_vectorized

python scripts/compare_against_py_vollib_vectorized.py

Expected output is the maximum absolute pricing and IV difference between the two libraries on a shared synthetic fixture.

WRDS validation

python scripts/wrds_benchmark.py --instrument-dir /path/to/wrds/export

This script requires local access to WRDS OptionMetrics data and is intended for institutional environments. It reports aggregate error metrics only.

Notebook workflow

Open:

notebooks/fast_vollib_comparison.ipynb

Run all cells after installing the optional extras for the backend you want to test.


Jäckel IV solver performance

The jackel/ module provides a machine-precision IV solver using Jäckel's "Let's Be Rational" algorithm. The table below shows the optimisation trajectory on the canonical benchmark grid (N = 100,000 options, H100 NVL GPU).

CPU chain (NumPy → Numba, 100k options)

Stage Time (ms) Speedup Max rel err
NumPy baseline (I-1) 106.5 3.2 × 10⁻¹⁴
+ fused vega (I-3) 88.9 1.2× 3.2 × 10⁻¹⁴
+ Numba Householder (I-4) 58.6 1.8× 3.2 × 10⁻¹⁴
+ Numba boundary kernel (I-4b) 37.0 2.9× 3.2 × 10⁻¹⁴
+ Hermite initial guess (I-4c/d) 15.5 6.9× 1.7 × 10⁻¹⁵
+ Numba Hermite kernel (I-4e) 11.6 9.2× 1.7 × 10⁻¹⁵
+ Numba preproc/postproc (I-4f) 8.5 12.5× 2.2 × 10⁻¹⁵

GPU backends (100k options, CUDA events)

Backend Compute (ms) Wall-clock (ms) Max rel err
torch.compile (I-5) 2.7 4.8 2.8 × 10⁻¹⁵
JAX lax.fori_loop (I-6) 2.4 2.4 4.9 × 10⁻¹⁵
Triton single-pass (I-7) 0.056 2.1 9.3 × 10⁻¹⁴

The Triton kernel is 11× faster than the 0.636 ms Halley×8 target and 1905× faster than the CPU baseline.

To reproduce:

uv run python scripts/jackel_triton_bench.py

Reporting benchmark results responsibly

When quoting fast-vollib timings, include:

  • hardware model
  • backend (numpy, torch, or jax)
  • Python and package versions
  • whether the result is pre- or post-warmup
  • batch size and data generation protocol

The repository intentionally does not hard-code a single set of local fast-vollib throughput tables in the docs, because those numbers drift quickly with compiler, CUDA, and CPU/GPU changes.