Jäckel IV — Machine-Precision Implied Volatility¶

The fast_vollib.jackel module is a standalone, full-precision implied-volatility solver based on Peter Jäckel's "Let's Be Rational" (2016) algorithm. It solves the normalised Black call problem to machine precision (max relative error ~10⁻¹⁴) using a four-branch rational initial guess and three Householder(3) iterations — and it does so in a single GPU kernel pass with Triton.

Why a separate solver?¶

The production backends/ IV solver uses Halley's method × 8 iterations with a compiled bisection fallback. It is fast and robust for typical market data but does not formally guarantee machine precision for all inputs.

The jackel/ module provides:

Property	`backends/` Halley×8	`jackel/` Householder(3)×3
Algorithm	Halley + bisection	Jäckel LBR rational guess + HH(3)
Max relative error	~10⁻⁸ (typical)	~10⁻¹⁴ (guaranteed)
Convergence	8 iterations + fallback	3 Householder(3) steps
CPU (NumPy + Numba)	~8 ms / 100k	~8.5 ms / 100k
GPU — torch.compile	~2.8 ms / 100k	~2.7 ms / 100k
GPU — JAX jit	~3.85 ms / 100k	~2.4 ms / 100k
GPU — Triton (compute)	0.636 ms / 100k	0.056 ms / 100k

Numbers measured on an NVIDIA H100 NVL, N = 100,000 options, canonical near-ATM grid (K = F·exp(U(−0.3, 0.3)), σ = U(0.10, 0.60)).

Algorithm overview¶

The algorithm operates in the normalised Black domain to decouple the IV problem from forward price and time-to-expiry:

Normalisation — reduce any call/put/ITM/OTM input to an OTM call β = price / √(FK) with reduced log-moneyness x ≤ 0.
Boundary values — compute the inflection point s_c = √(2|x|) and one Newton step left/right to get (s_l, β_l) and (s_h, β_h).
Initial guess — cubic Hermite interpolation in (β, σ) space for the interior (zones 2/3); boundary values for the extremes (zones 1/4).
Householder(3) × 3 — three iterations of the third-order rational Householder step with a three-branch objective function (log-space for near-zero β, complementary log-space for near-β_max, linear otherwise).
Denormalisation — σ = σ̂ / √T.

Quick start¶

import numpy as np
from fast_vollib.jackel.jackel_iv import jackel_iv_black

F     = 100.0
K     = np.array([95.0, 100.0, 105.0])
T     = 1.0
price = np.array([8.02,  7.97,  5.57])   # undiscounted Black-76 call prices

sigma = jackel_iv_black(price, F, K, T, is_call=True)
# array([0.2000..., 0.2000..., 0.2000...])

All inputs are undiscounted (Black-76 convention). The jackel_iv_black function handles put-call reduction internally — pass is_call=False for puts.

API reference¶

`jackel_iv_black` (NumPy / Numba)¶

from fast_vollib.jackel.jackel_iv import jackel_iv_black

jackel_iv_black(
    price,       # undiscounted option price — ndarray or scalar
    F,           # forward price — float or ndarray
    K,           # strike — ndarray or scalar
    T,           # time to expiry in years — float or ndarray
    is_call=True # True = call, False = put — bool or bool ndarray
) -> np.ndarray  # annualised IV; NaN for degenerate inputs

Uses the NumPy + Numba JIT chain (CPU). Six parallel Numba kernels cover preproc, boundary, Hermite guess, Householder loop, and postproc.

`jackel_iv_black_torch` (PyTorch)¶

import torch
from fast_vollib.jackel.torch_backend import jackel_iv_black_torch

price_t = torch.tensor([8.02, 7.97, 5.57], dtype=torch.float64, device="cuda")
K_t     = torch.tensor([95.0, 100.0, 105.0], dtype=torch.float64, device="cuda")

sigma_t = jackel_iv_black_torch(price_t, F=100.0, K=K_t, T=1.0, is_call=True)

The Householder loop is wrapped with torch.compile(dynamic=True) so the erfcx evaluations and branch dispatches fuse into a single CUDA kernel.

Performance: ~2.7 ms GPU compute / 100k options (CUDA events, H100 NVL).

`jackel_iv_black_jax` (JAX)¶

import jax.numpy as jnp
from fast_vollib.jackel.jax_backend import jackel_iv_black_jax

price_j = jnp.array([8.02, 7.97, 5.57])
K_j     = jnp.array([95.0, 100.0, 105.0])

sigma_j = jackel_iv_black_jax(price_j, F=100.0, K=K_j, T=1.0, is_call=True)

Uses jax.lax.fori_loop inside a @jax.jit-compiled function so the full Householder loop is XLA-fused into one GPU kernel. Requires jax.config.x64 = True (enabled automatically at import time).

Performance: ~2.4 ms / 100k options (H100 NVL) — 2× faster than the PyTorch backend due to more aggressive XLA loop fusion.

`jackel_iv_triton` (Triton — fastest)¶

import torch
from fast_vollib.jackel.triton_kernels import jackel_iv_triton

price_t = torch.tensor([8.02, 7.97, 5.57], dtype=torch.float64, device="cuda")
K_t     = torch.tensor([95.0, 100.0, 105.0], dtype=torch.float64, device="cuda")

sigma_t = jackel_iv_triton(price_t, F=100.0, K=K_t, T=1.0, is_call=True)

A single @triton.jit kernel fuses the entire pipeline — preproc, boundary, Hermite guess, and three Householder(3) iterations — in one pass over HBM. All intermediate values live in registers; there are no intermediate writes.

Performance: ~0.056 ms GPU compute / 100k options (CUDA events, H100 NVL) — 11× faster than the 0.636 ms Halley×8 Triton target.

Triton vs. erfcx

Triton 3.x does not expose erfcx. The kernel uses the erf-based normalised Black formula exp(x/2)·N(h+t) − exp(−x/2)·N(h−t) instead. This is numerically stable for all standard option ranges and achieves max relative error ~10⁻¹³.

Backend selection¶

The Jäckel module does not go through the backends/ routing layer — each function is called directly. Choose based on your environment:

Situation	Recommended call
CPU only, max precision	`jackel_iv_black` (NumPy + Numba)
GPU, existing torch pipeline	`jackel_iv_black_torch`
GPU, existing JAX pipeline	`jackel_iv_black_jax`
GPU, maximum throughput	`jackel_iv_triton`

Benchmarking¶

Run the dedicated benchmark script:

uv run python scripts/jackel_triton_bench.py

This script:

Generates the canonical 100k near-ATM grid
Computes reference prices via py_lets_be_rational (Jäckel C oracle)
Runs correctness check (max_rel_err < 1e-8)
Times the Triton kernel with CUDA events (20 timed rounds after 5 warm-up)

Correctness¶

Parity tests live in tests/test_jackel/test_parity.py and run automatically with the test suite (requires the dev dependency group):

uv sync                       # installs py-lets-be-rational via dev group
uv run pytest tests/test_jackel/ -v

The oracle is py_lets_be_rational (Peter Jäckel's reference C implementation wrapped for Python). It has no dependency on fast-vollib.

References¶

P. Jäckel, "Let's Be Rational", Wilmott Magazine, January 2015. jaeckel.org
Source: jaeckel.org/LetsBeRational.7z