NeoLDM · 01 — Raw-feature baseline¶
13 hand-crafted features → XGBoost. Evaluated on the full test set, averaged over seeds.
Setup¶
Balanced training (all fraud + downsampled non-fraud), full test set (2.44M txns, ~2,700 fraud, 0.11%), averaged over seeds. AUPRC is the primary metric, reported as × random (the multiple over the 0.11% no-skill rate); per-transaction F1 (threshold tuned on validation) is also shown. AUROC is omitted — it saturates near the ceiling at this prevalence.
In [1]:
import json
m=json.load(open('results/fulltest_finetune.json')) # raw_13d arm is shared across runs
r=m['arms']['raw_13d']; t=m['test']
print(f"raw 13-feature XGBoost — full test ({t['rows']:,} rows, {t['fraud']:,} fraud, {t['rate']:.3%})")
print(f" AUPRC {r['auprc_mean']:.3f} ± {r['auprc_std']:.3f} ({r['auprc_mean']/t['rate']:.0f}x random, mean over {m['seeds']} seeds) F1 {r['f1_mean']:.3f} ± {r['f1_std']:.3f}")
raw 13-feature XGBoost — full test (2,412,326 rows, 2,698 fraud, 0.112%) AUPRC 0.176 ± 0.034 (157x random, mean over 4 seeds) F1 0.301 ± 0.041
