File size: 3,871 Bytes
b6be5d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
license: mit
tags:
  - tabular-regression
  - sklearn
  - xgboost
  - random-forest
  - motorsport
  - lap-time-prediction
datasets:
  - Haxxsh/gdgc-datathon-data
language:
  - en
pipeline_tag: tabular-regression
---

# GDGC Datathon 2025 - Lap Time Prediction Models

Trained models for predicting Formula racing lap times from the GDGC Datathon 2025 competition.

## Model Description

This repository contains ensemble models trained to predict `Lap_Time_Seconds` for Formula racing events. The models use a combination of Random Forest and XGBoost regressors with cross-validation.

### Models Included

| File | Description | Size |
|------|-------------|------|
| `rf_final.pkl` | Final Random Forest model | 158 MB |
| `xgb_final.pkl` | Final XGBoost model | 2.6 MB |
| `rf_cv_models.pkl` | Random Forest CV fold models | 13.4 GB |
| `xgb_cv_models.pkl` | XGBoost CV fold models | 103 MB |
| `rf_model.pkl` | Base Random Forest model | 95 MB |
| `xgb_model.pkl` | Base XGBoost model | 2 MB |
| `feature_engineer.pkl` | Feature preprocessing pipeline | 6 KB |
| `best_params.json` | Optimal hyperparameters | 1 KB |
| `cv_results.json` | Cross-validation results | 1 KB |

## Training Data

The models were trained on the [GDGC Datathon 2025 dataset](https://huggingface.co/datasets/Haxxsh/gdgc-datathon-data):

- **Training samples:** 734,002
- **Target variable:** `Lap_Time_Seconds` (continuous)
- **Target range:** 70.001s - 109.999s
- **Target distribution:** Nearly symmetric (mean ≈ 90s, std ≈ 11.5s)

### Features

The dataset includes features such as:
- Circuit characteristics (length, corners, laps)
- Weather conditions (temperature, humidity, track condition)
- Rider/driver information (championship points, position, history)
- Tire compounds and degradation factors
- Pit stop durations

## Usage

### Loading the Models

```python
import pickle
import joblib

# Load the final models
with open("rf_final.pkl", "rb") as f:
    rf_model = pickle.load(f)

with open("xgb_final.pkl", "rb") as f:
    xgb_model = pickle.load(f)

# Load feature engineering pipeline
with open("feature_engineer.pkl", "rb") as f:
    feature_engineer = pickle.load(f)
```

### Making Predictions

```python
import pandas as pd

# Load test data
test_df = pd.read_csv("test.csv")

# Apply feature engineering
X_test = feature_engineer.transform(test_df)

# Predict with ensemble (average of RF and XGB)
rf_preds = rf_model.predict(X_test)
xgb_preds = xgb_model.predict(X_test)
ensemble_preds = (rf_preds + xgb_preds) / 2
```

### Download from Hugging Face

```python
from huggingface_hub import hf_hub_download

# Download a specific model file
model_path = hf_hub_download(
    repo_id="Haxxsh/gdgc-datathon-models",
    filename="xgb_final.pkl"
)

# Load it
with open(model_path, "rb") as f:
    model = pickle.load(f)
```

## Hyperparameters

Best parameters found via cross-validation (see `best_params.json`):

```json
{
  "random_forest": {
    "n_estimators": 100,
    "max_depth": null,
    "min_samples_split": 2,
    "min_samples_leaf": 1
  },
  "xgboost": {
    "n_estimators": 100,
    "learning_rate": 0.1,
    "max_depth": 6
  }
}
```

## Evaluation

Cross-validation results are stored in `cv_results.json`. Primary metric: **RMSE** (Root Mean Squared Error).

## Training Code

The training code is available on GitHub: [ezylopx5/DATATHON](https://github.com/ezylopx5/DATATHON)

Key files:
- `train.py` - Main training script
- `features.py` - Feature engineering
- `predict.py` - Inference script

## Framework Versions

- Python 3.8+
- scikit-learn
- XGBoost
- pandas
- numpy

## License

MIT License

## Citation

```bibtex
@misc{gdgc-datathon-2025,
  author = {Haxxsh},
  title = {GDGC Datathon 2025 Lap Time Prediction Models},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Haxxsh/gdgc-datathon-models}
}
```