Orig+Pat fit time
—
Pat Only fit time
—
AUC comparison
—
Mem Δ comparison
—
Fit time vs n (log–log, 4 threads)
Orig+PatPat OnlyXGBoostLightGBM
Test AUC vs n
Orig+PatPat OnlyXGBoostLightGBM
n-Scaling
Fit time vs n (log–log)
Orig+PatPat OnlyXGBoostLightGBM
Test AUC vs n
Orig+PatPat OnlyXGBoostLightGBM
Patterns mined vs n
| n | Orig+Pat fit | Pat Only fit | XGBoost fit | LightGBM fit | O+P AUC | P-Only AUC | XGB AUC | LGB AUC |
|---|
p-Scaling
Fit time vs p
Orig+PatPat OnlyXGBoostLightGBM
Fit time ratio vs XGBoost (<1.0 = faster than XGB)
Orig+Pat/XGBPat Only/XGBLightGBM/XGB
Test AUC vs p
Orig+PatPat OnlyXGBoostLightGBM
| p | n | Orig+Pat fit | Pat Only fit | XGBoost fit | LightGBM fit | O+P AUC | P-Only AUC | XGB AUC | LGB AUC |
|---|
Memory
Fit Δ memory: peak RSS during fit minus RSS after data load. Reflects additional working memory allocated during training only. All runs at 4 threads.
Fit Δ memory (MB) vs n
Orig+PatPat OnlyXGBoostLightGBM
Fit Δ memory (MB) vs p
Orig+PatPat OnlyXGBoostLightGBM
| n | Orig+Pat (MB) | Pat Only (MB) | XGBoost (MB) | LightGBM (MB) |
|---|
| p | n | Orig+Pat (MB) | Pat Only (MB) | XGBoost (MB) | LightGBM (MB) | O+P / XGB | P-Only / XGB |
|---|
Parameter Sweep
Sensitivity of fit time, AUC, and pattern count to individual HUGIML hyperparameters. Orig+Pat mode unless noted. Baseline signal dataset, 4 threads.
Methodology
Benchmark setup, dataset descriptions, and model configurations.
Environment
| HUGIML | v1.1.9 (C++ pybind11) |
| Python | 3.13.5 |
| XGBoost | 50 trees · max_depth=4 |
| LightGBM | 50 trees · max_depth=4 |
| Threads | 4 (n_jobs=4) |
| Train/test split | 75% / 25%, stratified, seed=42 |
| Metric | Test-set ROC AUC |
| Memory | Peak RSS − RSS after data load |
Datasets
baseline_signal: float32 features, nonlinear signal in 8 of p features (rest noise). p=20 fixed for n-scaling; p varied 20–3000 for p-scaling at moderate n. n up to 3,000,000.
threshold_grid: Uniform(−1,1) features, up to 96 threshold terms and interaction terms, labels median-binarised. p=200 fixed for n-scaling; p varied 20–3000 for p-scaling. n up to 500,000.
threshold_grid: Uniform(−1,1) features, up to 96 threshold terms and interaction terms, labels median-binarised. p=200 fixed for n-scaling; p varied 20–3000 for p-scaling. n up to 500,000.
HUGIML feature modes
■ original_plus_patterns (strict):
feature_mode='original_plus_patterns', topk_budget_strict=True. Downstream matrix = n × (original features + mined patterns), total capped at topK=50 columns. Higher accuracy, higher memory and fit cost due to wider downstream matrix.
■ patterns_only:
feature_mode='patterns_only', topk_budget_strict=False. Downstream matrix = n × mined patterns only. Lower accuracy, but 3–5× faster fit and substantially lower memory at large p, since the matrix width equals the (small) number of patterns rather than p + patterns.
Shared config:
adaptive_binning=True, b_candidates=[3,5,7,10], L=1, G=0.01, topK=50, n_jobs=4, use_hotpath=True, augmented_pair_transforms=False