Upload README.md with huggingface_hub

b6be5d4 verified 16 days ago

3.87 kB

	---
	license: mit
	tags:
	- tabular-regression
	- sklearn
	- xgboost
	- random-forest
	- motorsport
	- lap-time-prediction
	datasets:
	- Haxxsh/gdgc-datathon-data
	language:
	- en
	pipeline_tag: tabular-regression
	---

	# GDGC Datathon 2025 - Lap Time Prediction Models

	Trained models for predicting Formula racing lap times from the GDGC Datathon 2025 competition.

	## Model Description

	This repository contains ensemble models trained to predict `Lap_Time_Seconds` for Formula racing events. The models use a combination of Random Forest and XGBoost regressors with cross-validation.

	### Models Included

	\| File \| Description \| Size \|
	\|------\|-------------\|------\|
	\| `rf_final.pkl` \| Final Random Forest model \| 158 MB \|
	\| `xgb_final.pkl` \| Final XGBoost model \| 2.6 MB \|
	\| `rf_cv_models.pkl` \| Random Forest CV fold models \| 13.4 GB \|
	\| `xgb_cv_models.pkl` \| XGBoost CV fold models \| 103 MB \|
	\| `rf_model.pkl` \| Base Random Forest model \| 95 MB \|
	\| `xgb_model.pkl` \| Base XGBoost model \| 2 MB \|
	\| `feature_engineer.pkl` \| Feature preprocessing pipeline \| 6 KB \|
	\| `best_params.json` \| Optimal hyperparameters \| 1 KB \|
	\| `cv_results.json` \| Cross-validation results \| 1 KB \|

	## Training Data

	The models were trained on the [GDGC Datathon 2025 dataset](https://huggingface.co/datasets/Haxxsh/gdgc-datathon-data):

	- Training samples: 734,002
	- Target variable: `Lap_Time_Seconds` (continuous)
	- Target range: 70.001s - 109.999s
	- Target distribution: Nearly symmetric (mean ≈ 90s, std ≈ 11.5s)

	### Features

	The dataset includes features such as:
	- Circuit characteristics (length, corners, laps)
	- Weather conditions (temperature, humidity, track condition)
	- Rider/driver information (championship points, position, history)
	- Tire compounds and degradation factors
	- Pit stop durations

	## Usage

	### Loading the Models

	```python
	import pickle
	import joblib

	# Load the final models
	with open("rf_final.pkl", "rb") as f:
	rf_model = pickle.load(f)

	with open("xgb_final.pkl", "rb") as f:
	xgb_model = pickle.load(f)

	# Load feature engineering pipeline
	with open("feature_engineer.pkl", "rb") as f:
	feature_engineer = pickle.load(f)
	```

	### Making Predictions

	```python
	import pandas as pd

	# Load test data
	test_df = pd.read_csv("test.csv")

	# Apply feature engineering
	X_test = feature_engineer.transform(test_df)

	# Predict with ensemble (average of RF and XGB)
	rf_preds = rf_model.predict(X_test)
	xgb_preds = xgb_model.predict(X_test)
	ensemble_preds = (rf_preds + xgb_preds) / 2
	```

	### Download from Hugging Face

	```python
	from huggingface_hub import hf_hub_download

	# Download a specific model file
	model_path = hf_hub_download(
	repo_id="Haxxsh/gdgc-datathon-models",
	filename="xgb_final.pkl"
	)

	# Load it
	with open(model_path, "rb") as f:
	model = pickle.load(f)
	```

	## Hyperparameters

	Best parameters found via cross-validation (see `best_params.json`):

	```json
	{
	"random_forest": {
	"n_estimators": 100,
	"max_depth": null,
	"min_samples_split": 2,
	"min_samples_leaf": 1
	},
	"xgboost": {
	"n_estimators": 100,
	"learning_rate": 0.1,
	"max_depth": 6
	}
	}
	```

	## Evaluation

	Cross-validation results are stored in `cv_results.json`. Primary metric: RMSE (Root Mean Squared Error).

	## Training Code

	The training code is available on GitHub: [ezylopx5/DATATHON](https://github.com/ezylopx5/DATATHON)

	Key files:
	- `train.py` - Main training script
	- `features.py` - Feature engineering
	- `predict.py` - Inference script

	## Framework Versions

	- Python 3.8+
	- scikit-learn
	- XGBoost
	- pandas
	- numpy

	## License

	MIT License

	## Citation

	```bibtex
	@misc{gdgc-datathon-2025,
	author = {Haxxsh},
	title = {GDGC Datathon 2025 Lap Time Prediction Models},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/Haxxsh/gdgc-datathon-models}
	}
	```