AlphaFold 3 on NVIDIA GB10 Blackwell GPU - Technical Report

Date: November 14-15, 2025 Hardware: NVIDIA DGX Spark with GB10 Blackwell GPU (Compute Capability sm_121a/sm_121) Goal: Get AlphaFold 3 running on Blackwell architecture

Executive Summary

Successfully configured AlphaFold 3 to run on NVIDIA's GB10 Blackwell GPU after resolving multiple compatibility issues across the Triton/JAX/CUDA stack. The main challenges were:

Triton compiler lacking sm_121 support
API incompatibilities between Triton versions and jax-triton
LLVM backend version mismatches
PTXAS assembler not recognizing sm_121a architecture suffix

Problem 1: Initial Triton Version Incompatibility

Issue

AlphaFold 3's pyproject.toml specified Triton 3.3.1, which uses LLVM 15. This LLVM version predates Blackwell GPU support.

Error Encountered

'sm_121a' is not a recognized processor for this target

Root Cause

Triton 3.3.x built with LLVM 15 (released before Blackwell)
LLVM 15 lacks Blackwell architecture definitions and NVVM intrinsics
Blackwell support added in LLVM ~18.1.8+ with full support in LLVM 20+

Investigation Path

Checked GitHub issue #394 which recommended CUDA 12.8+ and Triton 3.3.1
Realized issue was for GH200 (Hopper), not GB10 (Blackwell)
Found Triton PR #8498 (merged Oct 24, 2025) adding sm_120/sm_121 support to main branch
Discovered Triton main uses LLVM from Sept 2025 with Blackwell support

Solution

Changed Dockerfile to build Triton from main branch instead of release/3.3.x:

# Line 77 in docker/Dockerfile
git checkout main  # Changed from: git checkout release/3.3.x

Key Learning: Hardware architecture support requires matching LLVM version, not just CUDA version.

Problem 2: The 'a' Suffix Mystery (sm_121a vs sm_121)

Issue

Even after upgrading to Triton main, still got architecture recognition errors.

Error Encountered

LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.cp.async.commit.group

Root Cause

Triton was generating sm_121a but should generate sm_121. The 'a' suffix logic was:

suffix = "a" if capability >= 90 else ""

This applied the 'a' suffix to ALL compute capabilities ≥90, including sm_121 (Blackwell). However, the 'a' suffix is Hopper-specific (sm_90a) and should NOT be applied to Blackwell.

Investigation Path

Found GitHub issue triton-lang/triton#8543 discussing the suffix problem
Learned 'a' suffix indicates "architecture with TMA (Tensor Memory Accelerator)"
Discovered Hopper (sm_90) needs 'a', but Blackwell (sm_121) does not

Solution

Patched Triton compiler to restrict 'a' suffix to sm_90 only:

# In triton/backends/nvidia/compiler.py
suffix = "a" if capability == 90 else ""  # Changed from: capability >= 90

Applied via sed in Dockerfile:

sed -i 's/suffix = "a" if capability >= 90 else ""/suffix = "a" if capability == 90 else ""/' \
    third_party/nvidia/backend/compiler.py

Key Learning: Architecture suffixes are generation-specific, not monotonic across compute capabilities.

Problem 3: Triton API Changes (_builder vs _semantic)

Issue

AlphaFold 3's triton_utils.py used _builder parameter API, incompatible with Triton main.

Error Encountered

ValueError: Did you forget to add @triton.jit ? (_semantic argument must be provided outside of JIT functions.)

Root Cause

AlphaFold 3 commit ec4254a upgraded to _semantic API for Triton 3.5+
Commit f2edd59 reverted to _builder for Triton 3.3.x compatibility
Triton main (3.5+) requires _semantic parameter instead of _builder

Investigation Path

Found GitHub issue alphafold#486 discussing the revert
Traced commit history: ec4254a (upgrade) → f2edd59 (revert) → latest (still reverted)
Realized AlphaFold 3 stayed on _builder to maintain Triton 3.3.x compatibility

Solution

Modified AlphaFold 3's triton_utils.py to use _semantic API:

# src/alphafold3/jax/common/triton_utils.py
def _dot_fn(
    a: tl.core.tensor,
    b: tl.core.tensor,
    *,
    trans_a: bool = False,
    trans_b: bool = False,
    _semantic,  # Changed from _builder
):
    # ... all tl.static_assert and tl operations now use _semantic=_semantic

Key Learning: Major version upgrades often require API migrations; check for reverted commits in dependencies.

Problem 4: jax-triton Version Incompatibility

Issue

jax-triton 0.3.0 incompatible with Triton main's API changes.

Error Encountered

TypeError: CUDABackend.make_ttir() missing 1 required positional argument: 'capability'

Root Cause

jax-triton 0.3.0 calls CUDABackend.make_ttir() without capability argument
Triton main changed function signature to require capability parameter
jax-triton main branch has compatibility issues with native_specialize_impl

Investigation Path

Got error about missing capability argument
Found GitHub issue jax-ml/jax-triton#365 discussing version mismatches
Discovered commit 0c6c888 (one commit before main) works with Triton 3.5.x

Solution

Updated pyproject.toml and dev-requirements.txt to use specific working commit:

# pyproject.toml
"jax-triton @ git+https://github.com/jax-ml/jax-triton.git@0c6c888"

# dev-requirements.txt
git+https://github.com/jax-ml/jax-triton.git@0c6c888

Key Learning: When upgrading major dependencies, intermediate library versions may be incompatible; pinning to specific commits can bridge compatibility gaps.

Problem 5: PTXAS Version Mismatch

Issue

Triton's bundled PTXAS assembler doesn't recognize sm_121a architecture.

Error Encountered

ptxas fatal: PTX with .target 'sm_121a' cannot be compiled for architecture 'sm_121'

Root Cause

Triton bundles an older PTXAS binary at triton/backends/nvidia/bin/ptxas
This bundled version predates Blackwell support
System CUDA 12.9+ includes updated PTXAS that supports sm_121a
nvidia-cuda-nvcc-cu12 package also provides compatible PTXAS

Investigation Path

Initially thought environment variable TRITON_PTXAS_PATH would solve it
Realized the problem was dual: (a) Triton generating sm_121a AND (b) PTXAS not accepting it
Found GitHub issue triton-lang/triton#8539 confirming PTXAS version issue
PR #8543 merged to fix this permanently

Solution

Set environment variable to use newer PTXAS:

export TRITON_PTXAS_PATH=/alphafold3_venv/lib/python3.12/site-packages/nvidia/cuda_nvcc/bin/ptxas

Added to Dockerfile:

ENV TRITON_PTXAS_PATH="/alphafold3_venv/lib/python3.12/site-packages/nvidia/cuda_nvcc/bin/ptxas"

Key Learning: Compiler toolchains include multiple binaries (compiler + assembler); version mismatches can occur at any stage.

Final Working Configuration

Dockerfile Changes

Triton: Build from main branch with sm_121 support
Suffix Patch: Change capability >= 90 to capability == 90
PTXAS Path: Set environment variable to use nvidia-cuda-nvcc PTXAS

# Line 77: Use Triton main instead of 3.3.x
git checkout main

# Line 79: Apply suffix patch
sed -i 's/suffix = "a" if capability >= 90 else ""/suffix = "a" if capability == 90 else ""/' \
    third_party/nvidia/backend/compiler.py

# Line 108: Set PTXAS path
ENV TRITON_PTXAS_PATH="/alphafold3_venv/lib/python3.12/site-packages/nvidia/cuda_nvcc/bin/ptxas"

pyproject.toml Changes

Remove triton: Let Triton be built from source (line 28 removed)
Update jax-triton: Use compatible commit

dependencies = [
    "absl-py",
    "chex",
    "dm-haiku==0.0.14",
    "dm-tree",
    "jax==0.6.0",
    "jax[cuda12]==0.6.0",
    "jax-triton @ git+https://github.com/jax-ml/jax-triton.git@0c6c888",  # Updated
    # triton==3.3.1 removed - built from source
    "jaxtyping==0.3.2",
    "numpy",
    "rdkit==2024.3.5",
    "tqdm",
    "typeguard==2.13.3",
    "zstandard",
]

AlphaFold 3 Source Changes

Modified src/alphafold3/jax/common/triton_utils.py:

Changed all _builder parameters to _semantic
Updated all function calls to use _semantic=_semantic

Lines 68-82: Updated dot function and all tl.* operations

Technical Stack Summary

Working Versions

CUDA: 12.9 (minimum 12.8 for Blackwell)
Triton: main branch (commit from late Oct 2025+, includes PR #8498)
jax-triton: commit 0c6c888 (compatible with Triton 3.5.x)
JAX: 0.6.0
Python: 3.12
LLVM: ~19/20 (bundled with Triton main, from Sept 2025)

Architecture Details

GPU: NVIDIA GB10 Blackwell
Compute Capability: 12.1a (reported as sm_121 after patch)
Architecture Features: No TMA suffix (unlike Hopper's sm_90a)

Key Rabbit Holes & Dead Ends

Rabbit Hole 1: Following GH200 Instructions

What we tried: Following GitHub issue #394 instructions for Triton 3.3.1 Why it failed: Issue was for GH200 (Hopper/sm_90), not GB10 (Blackwell/sm_121) Lesson: Always verify hardware architecture matches the compatibility guide

Rabbit Hole 2: Assuming Suffix Logic Was Correct

What we tried: Initially focused on LLVM intrinsic errors Why it failed: Didn't realize 'a' suffix was being incorrectly applied Lesson: Architecture naming conventions have specific meanings; investigate suffix logic

Rabbit Hole 3: Trying to Use jax-triton main

What we tried: Upgraded to jax-triton from main branch Why it failed: Main branch uses native_specialize_impl that doesn't exist in Triton 3.5.x Lesson: Library main branches may target unreleased dependency versions

Rabbit Hole 4: Only Setting TRITON_PTXAS_PATH

What we tried: Set PTXAS environment variable without patching suffix logic Why it failed: Triton was still generating sm_121a which even newer PTXAS couldn't handle correctly Lesson: Environment variables alone can't fix code generation issues

Testing Results

Test Command

docker run --name af3_test_gb10 --rm \
  --volume /home/ruh/research/PhD/related_work/alphafold3/af_input:/root/af_input \
  --volume /home/ruh/research/PhD/related_work/alphafold3/af_output:/root/af_output \
  --volume /home/ruh/data/af3_model:/root/models \
  --volume /home/ruh/data/af_public_databases:/root/public_databases \
  --volume /home/ruh/data/af3_jax_cache:/root/jax_cache \
  --gpus all --memory=64g \
  -e XLA_PYTHON_CLIENT_PREALLOCATE=false \
  -e TF_FORCE_UNIFIED_MEMORY=true \
  -e XLA_CLIENT_MEM_FRACTION=3.2 \
  alphafold3 python run_alphafold_test.py

Expected Behavior

✅ Triton kernels compile for sm_121
✅ PTXAS accepts PTX with sm_121 target
✅ JAX/XLA executes on GPU
✅ AlphaFold 3 inference completes
⚠️ Test assertions may fail (numerical precision differences), but inference runs

References

GitHub Issues

triton-lang/triton#8498 - TMA gather4 support for sm_120/sm_121
triton-lang/triton#8539 - sm_121a PTXAS error
triton-lang/triton#8543 - Fix sm_121a ptxas error for Blackwell
jax-ml/jax-triton#365 - Version compatibility issues
google-deepmind/alphafold3#394 - GPU Blackwell compatibility
google-deepmind/alphafold3#486 - Triton API revert discussion
llvm/llvm-project#139543 - LLVM Blackwell support

Key Commits

AlphaFold 3 ec4254a: Upgraded to Triton _semantic API
AlphaFold 3 f2edd59: Reverted to _builder for 3.3.x compatibility
jax-triton 0c6c888: Compatible with Triton 3.5.x (one commit before main)

Recommendations for Future Work

For AlphaFold 3 Team

Update official compatibility guide to distinguish Hopper (GH200) vs Blackwell (GB10)
Consider maintaining separate branches for different Triton versions
Add CI testing for Blackwell architecture

For Users with Blackwell GPUs

Always build Triton from source (main branch) until official 3.6+ release
Apply suffix patch as standard practice
Use CUDA 12.9+ or 13.0+ for best compatibility
Pin jax-triton to commit 0c6c888 until official release supports Triton 3.5+

For Docker Deployments

Set TRITON_PTXAS_PATH environment variable in container
Build Triton during image creation, not at runtime
Cache JAX compilation artifacts to avoid recompilation

Conclusion

Getting AlphaFold 3 running on GB10 Blackwell required navigating a complex dependency chain:

CUDA 12.9 → Triton main (with suffix patch) → jax-triton 0c6c888 → AlphaFold 3 (with _semantic API)

The main challenges were:

Architecture support: Required bleeding-edge Triton/LLVM for sm_121
Suffix logic: Architectural detail that broke compilation
API evolution: Triton 3.3 → 3.5 API changes
Toolchain versions: PTXAS assembler version mismatch

Total time investment: ~2-3 hours of debugging across multiple GitHub issues and commit histories.

The experience highlights the challenges of running scientific software on cutting-edge hardware where the software ecosystem is still catching up to hardware releases.

Problem 6: CUDA Runtime Kernel Launch Error (ONGOING)

Issue

After fixing compilation issues, getting runtime CUDA error during kernel execution.

Error Encountered

CUDA_ERROR_INVALID_VALUE
operation gpuLaunchKernel(...) failed
Fatal Python error: Segmentation fault

Root Cause (Suspected)

Triton kernels may be using launch parameters (block dimensions, grid dimensions, or shared memory) that are incompatible with GB10 Blackwell architecture limits or have bugs specific to sm_121.

Investigation Path

Fixed compilation (sm_121a → sm_121) ✓
Kernel compiles and loads successfully ✓
Kernel launch fails with INVALID_VALUE
Likely issues:
- Block dimensions exceeding Blackwell limits
- Invalid grid dimensions for sm_121
- Shared memory allocation issues
- Triton autotune configs not tested on sm_121

Potential Solutions

Check Triton version: Ensure using latest main with all Blackwell fixes
Disable Triton kernels: Test if falling back to XLA kernels works
Debug kernel parameters: Add logging to see actual launch parameters
Report to Triton: This may be a bug in Triton's Blackwell kernel generation
Try different Triton commit: Use a more recent commit from main branch

Status

⚠️ BLOCKED ON UPSTREAM - Compilation works, runtime blocked by incomplete JAX Blackwell support

Root Cause Identified

The kernel launch failure is due to incomplete Blackwell support in JAX/jaxlib, not Triton:

JAX Issue #31399: "llvm21 + blackwell family support" - Still OPEN as of Nov 14, 2025
JAX 0.6.0 (required by AlphaFold 3) released April 2024, before Blackwell support
JAX 0.8.0 (latest, Oct 15, 2025) includes DGX Spark PR #32653 but Blackwell support incomplete
LLVM 21 support needed for full Blackwell compatibility, but JAX doesn't support LLVM 21 yet

Attempted Solutions

✅ Triton suffix patch applied - sm_121a → sm_121 fixed
✅ PTXAS path configured - Using CUDA 12.9 PTXAS
✅ jax-triton compatibility tested - Tried 0c6c888, 6b9682a, and main
❌ All jax-triton versions fail:
- 0c6c888: CUDA_ERROR_INVALID_VALUE kernel launch failure
- 6b9682a & main: native_specialize_impl missing (requires newer Triton API)

Status: ⚠️ BLOCKED - Waiting for upstream JAX Blackwell support Date: November 15, 2025 Blocking Issues:

JAX #31399 - Blackwell family support (OPEN)
JAX needs LLVM 21 support for full Blackwell compatibility
jaxlib 0.6.0 doesn't have complete sm_121 runtime support

What Works:

✅ Triton compiles kernels for sm_121
✅ PTXAS assembles PTX code successfully
✅ Kernels load into GPU

What Doesn't Work:

❌ Kernel launch fails with CUDA_ERROR_INVALID_VALUE
❌ JAX/jaxlib doesn't fully support sm_121 runtime operations

Next Steps (When Unblocked):

Wait for JAX team to complete Blackwell support in issue #31399
Upgrade to JAX 0.9.0+ when released with full Blackwell support
Rebuild Docker image with updated JAX/jaxlib
Test AlphaFold 3 with full Blackwell stack

Estimated Timeline:

Monitor JAX releases quarterly (check Jan 2026, Apr 2026)
Blackwell support likely complete in JAX 0.9.0 or 1.0.0
Until then, recommend using Hopper (H100/GH200) hardware for AlphaFold 3