Li Jinxuan

Email: lijinxuan1101_work@outlook.com
Tel(SG): +65 80398859
Tel(CN): +86 17792789882

National University of Singapore

Master of Computing, Information Systems

Sep 2025 – Jun 2027 · Singapore

Central South University

Bachelor of Management, Information Management and Information System

Sep 2021 – Jun 2025 · Changsha, China

Internships

Intern: Data Science Intern at 6ESTATES SG

Introduction IDP/OCR system development: Built OCR pipelines for bank statements; automated extraction of financial fields and transaction tables across 10+ banks, delivering end-to-end visualization and structured outputs on the production platform. Prompt engineering for credit risk: Designed prompts to extract delinquency and credit-risk signals from SLIK reports; fixed model misclassification issues and automated credit-risk extraction with GPT-4.1, improving accuracy and robustness of the SLIK parsing workflow.

Intern: Fixed Income Intern in CIB Co., Ltd.

Introduction Prepared daily, weekly, and monthly reports on the bond market: Daily: recorded and analyzed bond market transactions, government bond trends, as well as money and stock market activity and sentiment. Weekly: compiled interest rate briefs and drafted reports, including market liquidity review, central bank operations review, and market outlook analysis. Monthly: independently prepared macroeconomic data briefs and reports, covering interbank certificate of deposit trends, custody data analysis, institutional behavior reports, and commercial paper reports. Compiled and analyzed economic and financial data for monthly reports, including PMI, social financing, CPI & PPI, retail sales, industrial production, and import/export data, producing statistical analyses and commentary reports.

Intern: Assurance Data Analysis Intern in ERNST & YOUNG

Introduction Managed and processed financial data exported from Kingdee and UFIDA systems, handling 10,000+ data fields daily. Automated real estate data auditing with Openpyxl and Pandas, achieving faster queries, reduced latency, and streamlined procedures.

Intern: RA IBond Data Analyst Intern in Deloitte

Introduction Implemented automated ETL pipelines using WIND Excel plugin and templates to query and preprocess 80,000+ records daily for Deloitte IBond Smart Bond and CITIC Construction Investment teams. Optimized large-scale data retrieval and analysis pipelines with Python + SQL, improving processing speed and stability; proficient in WIND (Excel & Python integration).

Projects

Project: Email Classifier Multi-agent System

Email Classifier Multi-agent System 1. Project Overview A course final project delivering an end-to-end email classification system. It expands a HuggingFace base dataset (jason23322/high-accuracy-email-classifier) with synthetic emails generated via OpenAI API, trains and evaluates models in notebooks, and ships a Streamlit app for interactive use. Deployed demo: https://email-manager.streamlit.app/. 2. Repository Structure Data: Combined dataset with synthetic augmentation. Notebooks: EDA, preprocessing, training, pipeline export, API call simulation (Classification.ipynb, Final project.ipynb). Email pipeline: Production script (email_pipeline.py) and joblib checker (check_joblib.py). Deployment: Streamlit app config (.streamlit/) and Vercel deployment files. Reports & Slides: LaTeX/Overleaf reports and presentation. CI: GitHub workflows for lint/test. 3. Model & Pipeline Text cleaning, tokenization, and vectorization for email bodies. Supervised classifiers (documented in notebooks) with joblib-exported pipelines. Evaluation tracked in notebooks and reports; artifacts stored for reuse. 4. How to Run (Local) pip install -r requirements.txt Explore/train in notebooks (Classification.ipynb / Final project.ipynb). Serve app: streamlit run email_pipeline.py (or follow deployment/ README). Verify artifacts: python check_joblib.py. 5. Highlights Dataset augmentation via LLM to improve coverage. Full transparency: notebooks document each step from data to deployment. Deployed Streamlit demo plus reproducible local scripts. Project Link https://github.com/naufalad/IS5126-Final-Project ...

Project: Credit Risk Forecasting

Credit Risk Forecasting 1. Project Overview A full pipeline for the CS5344 (Track 2) finance task: ingest tabular/temporal credit data, engineer domain-specific features, benchmark baseline models, and train optimized ensembles for default risk prediction. Repository structure covers data prep, feature engineering, baseline experiments, Bayesian hyperparameter search, and final models ready for evaluation. 2. Workflow & Components Data & docs: data/, column and feature explanations (列说明文档.md, 特征说明文档.md, Feature_Documentation_EN.md). Feature engineering: feature_engineering/ scripts plus feature_tests/ for validation. Baselines: baseline.py with results stored under baseline_models/results/. Optimization: Bayesian Optimization.py for tuning key hyperparameters. Final models: packaged in final_models/ with model.py for inference. Utilities: requirements.txt for environment setup; PDF problem statement (CS5344_Formal_Problem_Formulation.pdf). 3. Models & Techniques Gradient boosting and tree ensembles as core predictors. Bayesian optimization to search learning rates, depths, and regularization. Feature importance checks to align signals with financial intuition. Train/validation splits and hold-out tests for generalization. 4. How to Run pip install -r requirements.txt Prepare data under data/ following the provided column spec. Run baselines: python baseline.py (results in baseline_models/results/). Tune: python "Bayesian Optimization.py" to sweep hyperparameters. Train/evaluate final model: python model.py. 5. Highlights Clear documentation for columns/features eases reproducibility. Modular scripts split by stage (baseline → feature eng → tuning → final). Results and artifacts versioned for comparison. Project Link https://github.com/lijinxuan1101/cs-5344-track-2-finance ...

Project: Corrective RAG Adaptive QA System

Corrective RAG Adaptive QA System 1. Project Overview An adaptive Retrieval-Augmented Generation (RAG) system that adds a self-evaluation step: if local retrieval is weak, it automatically triggers Tavily web search to fetch better context, reducing hallucination and “forced answers.” Built with LangGraph to model the end-to-end QA state machine. Key stack: Python, LangGraph, OpenAI API, ChromaDB, Tavily Search, FastAPI, Streamlit. 2. Architecture Flow: Query → Retrieve from ChromaDB → Grade relevance → Relevant? → Generate answer Not/uncertain? → Tavily web search → Rerank → Generate answer. ...

Project: CoinPilot Premium User Conversion Prediction Systemt

CoinPilot Premium User Conversion Prediction System Executive Summary The CoinPilot Premium User Conversion Prediction System is a comprehensive machine learning solution designed to predict user conversion to premium services in a fintech application. The system leverages ensemble learning techniques to analyze user behavior patterns, financial profiles, and engagement metrics to provide accurate conversion probability predictions. The project encompasses data analysis, model development, and deployment through a modern web-based architecture using FastAPI and Streamlit. ...

Project: A specialized Wikipedia Research Assistant

A specialized Wikipedia Research Assistant 1. Project Overview This project is an AI-powered, fully automated research system designed to scrape information from unstructured sources like Wikipedia, perform intelligent information extraction and structuring using Large Language Models (LLMs), and provide an interactive query and analysis platform accessible via natural language. The system constitutes an end-to-end data intelligence pipeline, from data acquisition and processing to analysis and visualization, demonstrating how modern AI technology can transform complex web information into actionable knowledge. ...

Project: O2O Coupon Usage Prediction

O2O Coupon Usage Prediction Project Summary In this project, I developed a machine learning model to predict whether customers would redeem coupons on an O2O (Online-to-Offline) platform. By engineering over 30 features from user and merchant behavior and employing a LightGBM classifier, the model provides valuable insights into the key drivers of coupon redemption, enabling more targeted and effective marketing strategies. The Business Challenge O2O platforms frequently issue coupons to attract customers and drive sales. However, untargeted coupon distribution can be costly and inefficient. The core challenge is to accurately predict which users are most likely to use a given coupon, enabling more effective marketing campaigns and maximizing return on investment. ...

Competitions

Competition: 2024 Mathematical Contest in Modeling

An Analysis of Sustainable Strategies for Property Insurance Introduction In this paper, I developed an LSTM model with an accuracy of 85% to predict future natural disasters using natural disaster data in Florida and California of the past 30 years. Applied neural network, linear regression, and deep learning models to assess the relationship between natural disasters and property damage. Summary In recent years, homeowners and insurance companies have faced significant crises, necessitating the development of comprehensive solutions to meet the needs of all stakeholders involved in the insurance industry. This paper presents an innovative approach to property insurance by introducing an insurance company’s property allocation model based on deep learning and LSTM and a market investment model utilizing regression analysis, as well as a community conservation building model employing grey correlation analysis. These models provide valuable insights and correlation analyses for the property-casualty insurance sector, promoting a more sustainable industry. ...

Competition: 2023 Huashu Cup Model Construction Competition

A Machine Learning and Evaluation Framework for Analyzing the Impact of Maternal Health on Infant Development Summary This study establishes machine learning models to analyze correlations between infant behavior characteristics, maternal physical and mental health indicators, and infant sleep quality, and proposes treatment strategies. Problem 1 Preprocessed infant behavior features and maternal health indicators. Designed hierarchical statistics for multiple variables: ANOVA for continuous variables and logistic regression for categorical variables. Conducted correlation analysis and multifactor ANOVA, finding significant relationships: Maternal age ↔ infant sleep patterns & behavior features Maternal gestation period ↔ infant wake-up frequency Maternal HADS score ↔ infant wake-up frequency, total sleep time, behavior features Maternal EPDS score ↔ infant wake-up frequency, total sleep time No significant effects were found for other indicators. Problem 2 Trained models using logistic regression, Random Forest, Neural Networks, and XGBoost. Selected XGBoost as the best-performing model (highest accuracy). Optimized model parameters using loss function minimization and cross-validation. Predicted the behavior types for the last 20 infant samples using the trained XGBoost model. Problem 3 Combined genetic algorithms with the XGBoost model from Problem 2 to generate treatment plans. Final treatment costs: Moderate type: 695 CNY Quiet type: 10,448 CNY Problem 4 Evaluated infant sleep quality using the CRITIC method. Established a comprehensive sleep quality ranking system using rank-sum ratio evaluation, classifying sleep quality as excellent, good, medium, or poor. Determined indicator weights with the CRITIC method. Trained a Random Forest model to associate comprehensive infant sleep quality with maternal health indicators, predicting sleep quality for the last 20 infant samples. Problem 5 Based on the evaluation and association models from Problem 4, calculated the initial sleep quality of infant #238. Applied the same approach as Problem 3, updating maternal indicators in the association model to generate a new treatment plan: Moderate type (sleep quality: excellent), minimum cost: 8,699 CNY Keywords: XGBoost, Genetic Algorithm, CRITIC Method, Rank-Sum Ratio Evaluation, Random Forest, Association Model ...