Fraud detection · IBM TabFormer benchmark

One cortex embedding beats hand-crafted features 5×

cortex turns a raw stream of card transactions into one 128-dimensional embedding per transaction. Feed that single vector to a standard XGBoost and fraud-detection AUPRC reaches 0.955 — on the full 2.41-million-transaction test. Same dataset, same classifier, same protocol; only the features change.

854×
over a no-skill classifier
5.4×
lift over raw features
2.41M
transactions · full test

It's the representation doing the work

cortex turns each cardholder's transaction sequence into a per-transaction hidden state — encoding the behavioral context (recency, merchant patterns, spend rhythm) that fraud teams normally chase with bespoke feature pipelines. The classifier and protocol are identical across lanes; only the features differ.

Two lanes — raw columns vs cortex embedding — into the same XGBoost

Results

Per-transaction, on the entire held-out split (2,412,326 transactions, 2,698 fraud, 0.112%) — not a convenience sample. Training is balanced (all fraud + downsampled non-fraud), early-stopped on the full validation split. × random is the multiple over a no-skill classifier (whose AUPRC equals the fraud rate). Mean ± std over 4 seeds.

AUPRC and F1: raw features vs cortex finetune embeddings
ModelFeaturesAUPRC× randomF1
Raw features13 raw columns0.176 ± 0.034157×0.301 ± 0.041
cortex · pretrainself-supervised · 128-d0.052 ± 0.01046×0.070 ± 0.026
cortex · pretrain + rawraw + hidden0.187 ± 0.036167×0.320 ± 0.033
cortex · finetunesupervised · 128-d0.896 ± 0.034801×0.870 ± 0.010
cortex · finetune + rawraw + hidden0.955 ± 0.007854×0.906 ± 0.012

AUROC is omitted: at this prevalence it saturates near the ceiling for almost any model, so it can't separate good from great. AUPRC is the honest metric.

Self-supervised (pretrain)

With no fraud labels, the embedding alone (0.052) sits below the raw baseline — a decoder-style representation trades fine-grained tabular signal for sequential context. But fused with raw features it already edges ahead (0.187 vs 0.176): it adds behavioral signal the raw columns miss.

Supervised (finetune)

Let cortex see is_fraud during finetuning and the embedding becomes a standalone fraud detector — 801× random on its own, 854× with raw. The representation, not the classifier, is doing the work.

How this compares to NVIDIA's Transaction Foundation Model

NeoLDM grew up alongside NVIDIA's Transaction Foundation Model blueprint — same dataset, same idea (pretrain on raw transactions, use the embedding for downstream fraud). It's the natural reference point, so here's the honest picture.

NVIDIA TFMNeoLDM · cortex
BackboneLlama decoder (~29M), causal LMcortex transaction FM
Embedding512-d last-token → 64-d PCA128-d hidden state
DownstreamXGBoostXGBoost
Test set100K stratified subsetfull 2.41M transactions
Published absolute metricsnone (notebook outputs not shipped)full table above
Embedding alone vs rawunderperforms raw features (their notebook)0.896 — 5.1× the raw baseline (finetune)

Not a same-protocol scoreboard — NVIDIA evaluates on a 100K stratified subset and publishes no absolute numbers, whereas every cortex figure here is on the full true test split. The blueprint's own conclusion is that its decoder embeddings lose to raw features alone; cortex reproduces that unsupervised, then a finetuned embedding stands on its own and beats the raw baseline 5.1×.

Run it yourself

Both notebooks render straight from the committed results — no GPU, dataset download, or checkpoint required just to see the numbers. View the rendered output, or download the .ipynb to run locally.

Notebook 01

Raw-feature baseline

13 raw columns fed to XGBoost — the traditional approach (AUPRC 0.176, 157× random).

Notebook 03

cortex embeddings

cortex hidden states fed to the same XGBoost (AUPRC 0.955, 854× random) and the lift over raw.

Dataset & embeddings

IBM TabFormer

The public credit-card-fraud dataset and conceptual ancestor of this line of work (arXiv:2011.01843, IBM/TabFormer). The raw CSV is fetched from IBM under IBM's terms — not redistributed here.

cortex embeddings

Per-transaction 128-d hidden states, published on Hugging Face: luizcoroo/cortex-ibm-tabformer-embeddings.