Credit Risk Forecasting

1. Project Overview

A full pipeline for the CS5344 (Track 2) finance task: ingest tabular/temporal credit data, engineer domain-specific features, benchmark baseline models, and train optimized ensembles for default risk prediction. Repository structure covers data prep, feature engineering, baseline experiments, Bayesian hyperparameter search, and final models ready for evaluation.


2. Workflow & Components

  • Data & docs: data/, column and feature explanations (列说明文档.md, 特征说明文档.md, Feature_Documentation_EN.md).
  • Feature engineering: feature_engineering/ scripts plus feature_tests/ for validation.
  • Baselines: baseline.py with results stored under baseline_models/results/.
  • Optimization: Bayesian Optimization.py for tuning key hyperparameters.
  • Final models: packaged in final_models/ with model.py for inference.
  • Utilities: requirements.txt for environment setup; PDF problem statement (CS5344_Formal_Problem_Formulation.pdf).

3. Models & Techniques

  • Gradient boosting and tree ensembles as core predictors.
  • Bayesian optimization to search learning rates, depths, and regularization.
  • Feature importance checks to align signals with financial intuition.
  • Train/validation splits and hold-out tests for generalization.

4. How to Run

  1. pip install -r requirements.txt
  2. Prepare data under data/ following the provided column spec.
  3. Run baselines: python baseline.py (results in baseline_models/results/).
  4. Tune: python "Bayesian Optimization.py" to sweep hyperparameters.
  5. Train/evaluate final model: python model.py.

5. Highlights

  • Clear documentation for columns/features eases reproducibility.
  • Modular scripts split by stage (baseline → feature eng → tuning → final).
  • Results and artifacts versioned for comparison.

https://github.com/lijinxuan1101/cs-5344-track-2-finance