Enterprise-grade telecom churn prediction pipeline — from raw CSV to production API. XGBoost + GPU acceleration, 250-trial Optuna tuning, Great Expectations validation, MLflow experiment tracking, FastAPI serving, and 13-job CI/CD to AWS ECR.
The FastAPI inference service is configured and containerised but the ECS task is kept stopped to avoid idle cloud charges.
The screenshots below show the live API in action — spin it up locally with docker run from the GitHub repo.
A full-stack ML engineering project solving telecom customer churn — built to enterprise production standards. The system ingests raw customer data, enforces data contracts with Great Expectations, engineers 11 domain-specific features, trains an XGBoost model on GPU, and serves predictions via a FastAPI microservice deployed to AWS.
Every layer is observable: MLflow tracks all experiments, GitHub Actions runs 13 CI/CD jobs on every push, and the prediction threshold is fully configurable at serving time — giving business teams control over precision vs. recall tradeoffs.
Four organizational roles, one shared intelligence layer.
/predict/batch API/ready endpoint in prodEvery false negative is a customer who churned without intervention. At average telecom CLV of $300–$800, a 3% miss rate on 51,500 samples represents $460K–$1.2M in recoverable revenue annually.
The threshold (0.35 default) is tunable at inference time. Lower threshold → higher recall, more outreach cost. Higher threshold → higher precision, fewer false alarms. Business owns the tradeoff — not the data scientist.
Engineered features like CLV, Risk Score, Engagement Score, and Contract Stability Index let marketing teams build retention cohorts directly from model outputs — without a separate analysis step.
13 CI/CD jobs, 140+ tests, health checks, and data validation gates mean the system degrades gracefully — not silently. Ops teams get observable, auditable, reproducible behaviour at every deploy.
Based on a 51,500-customer portfolio at industry-average telecom CLV
51,500 customers · avg monthly charge $65 · avg CLV $780 · retention offer cost $45/customer · baseline churn rate 17.1%
8 sequential stages from raw data to production model
Load train/test/holdout CSVs. Validate file existence, column presence, and basic schema before any processing begins.
Great Expectations suite: schema checks, business rule validation (gender, contract type, charge ranges), and statistical drift detection.
Column normalisation, null handling strategies per feature, categorical encoding, and TotalCharges numeric conversion.
11 domain features: CLV, Risk Score, Engagement Score, Contract Stability Index, Service Complexity, Tenure Groups, and more.
Optuna Bayesian optimisation over 250 trials. Recall-focused objective with XGBoost GPU backend and custom threshold search.
XGBoost with tree_method=gpu_hist. Best params from Optuna applied. Early stopping on validation recall. Full MLflow logging.
Holdout evaluation on 51,500 samples. Confusion matrix, classification report, AUC-ROC. Recall target ≥ 0.95 enforced by CI.
MLflow model registry + AWS S3 versioned artefact storage. Production version tagged, previous versions retained for rollback.
Four layers designed for observability, reproducibility and zero-downtime deployments
11 domain-specific features constructed from raw telecom signals
| Feature | Formula / Logic | Business Meaning |
|---|---|---|
| CLV | MonthlyCharges × tenure | Estimated lifetime revenue per customer |
| risk_score | Weighted contract + monthly charge + tenure signal | Composite churn propensity index |
| engagement_score | service_count × tenure_norm | How embedded the customer is in the product |
| contract_stability | contract_type_encoded × tenure | Contractual lock-in strength |
| service_complexity | Count of active add-on services | Switching friction from multiple services |
| tenure_group | Binned tenure: new / mid / loyal | Customer lifecycle stage |
| charge_per_service | MonthlyCharges / (service_count + 1) | Perceived value-for-money signal |
| paperless_auto_pay | PaperlessBilling AND AutoPay flag | Digital engagement indicator |
| senior_no_support | Senior citizen AND no tech support | High-vulnerability segment flag |
| high_value_churn_risk | CLV > median AND contract = monthly | Priority intervention flag for CRM |
| charge_increase_risk | TotalCharges / (tenure + 1) deviation | Detects unexpectedly rising cost burden |
13 jobs: black, isort, flake8, mypy, bandit, safety, pytest (unit/integration/e2e), coverage gate ≥ 80%, Docker build, ECR push.
Model artefacts versioned in S3. Container images built and pushed to ECR in CI. ECS deployment manifests included.
Every training run logs params, metrics, confusion matrix artifact, and model binary. Promotes best run to registry automatically.
Builder stage installs deps, runtime stage copies only artefacts. Non-root user, health checks, and PYTHONDONTWRITEBYTECODE optimisations.
Each stage is an isolated, testable unit. Stages compose into a DAG — easy to swap, extend, or run in parallel.
Prediction threshold passed as request parameter — no redeploy needed to change business operating point.
Great Expectations gates run before any model code. Bad data raises immediately — no silent model degradation.
Feature engineering, model training, serving, and validation each live in isolated modules — enabling independent testing and deployment.
The full infrastructure code, CI/CD pipeline, Docker configuration, and API implementation are available on GitHub.