Cloud analytics ML pipeline

At a glance

Problem
Reproducible Spark ML pipeline (ingest → features → train → evaluate → dashboard artifacts) with local ↔ GCP parity.
Role
Data + platform engineering demo
Timeframe
Prototype build
Stack
Python • PySpark • Spark MLlib • GCS
Focus
Spark • PySpark • MLlib
Results
A repeatable analytics ML pipeline that keeps feature builds, training results, and dashboards consistent across local and cloud environments.

Problem

Product analytics pipelines often drift between laptop and cloud runs: paths change, configs diverge, and feature logic becomes inconsistent. The result is untrusted dashboards, slow iteration, and models that can’t be audited.

Why this matters: consistent outputs across environments, faster reruns, and traceable model decisions that can be explained to stakeholders.

Executive summary

Config-driven Spark pipeline: ingest → session features → train/evaluate → dashboard exports
Local ↔ GCP Dataproc Serverless parity via Makefile + config.yaml / config.gcp.yaml
Deterministic artifacts in reports/ (metrics, figures, parquet aggregates)
Data is externalized (public GCS or BYO) to keep the repo clean and reproducible

Context

Architecture at a glance

Architecture at a glance — Raw clickstream CSVs are ingested into parquet (schema enforced + partitioned), session features are built, a baseline logistic regression model is trained/evaluated, and dashboard-ready aggregates are exported for Streamlit.

How to read the architecture — Ingestion produces immutable parquet partitions, feature jobs aggregate session-level intent, training consumes versioned feature sets, evaluation writes metrics/plots, and the dashboard reads only exported aggregates (no live Spark).

Real-world framing — This models product analytics: funnel conversion and propensity scoring. Sessionization captures intent, parquet partitioning keeps jobs repeatable, leakage filtering protects metrics, and time-based splits reflect deployment drift.

Cloud migration path — Same pipeline runs locally or on Dataproc Serverless by switching configs: local paths → GCS URIs, Spark master → serverless, and storage helpers route reads/writes without code changes.

The dataset is externalized (public GCS or BYO), enabling clean local ↔ cloud runs with the same configuration model.

How to verify

Quickstart (local)

make setup
./scripts/download_data.sh
make rerun_all_force
make dashboard

Run on Dataproc Serverless

bash scripts/gcp_run_all.sh
# or submit a single job
bash scripts/gcp_submit.sh

Links

GitHub repo Pipeline overview Dataproc scripts

Spark ML pipeline on GCP Dataproc Serverless with local parity

The same pipeline runs locally or on Dataproc Serverless by switching configs, keeping results consistent across environments.

This reduces deployment drift and makes results reproducible for stakeholders.

Data engineering workflow: ingest → features → train → evaluate → dashboard

Ingestion normalizes raw clickstream data into parquet, feature jobs build session-level signals, and evaluation exports metrics/plots.

Dashboards read precomputed aggregates for faster, cheaper review.

Architecture

Pipeline flow (ingest → model → dashboard)
- Ingest: schema-enforced parquet partitioning for reliable downstream processing.
- Feature engineering: session-level aggregates + labels with leakage-aware filtering.
- Modeling: Spark MLlib logistic regression with time-based split when available (random fallback otherwise).
- Exports: metrics, plots, and dashboard aggregates saved to reports/.
Architecture map (data movement)
- Raw → processed: CSVs are normalized into partitioned parquet for stable downstream reads.
- Processed → features: session-level joins/aggregations create the modeling table.
- Features → model: MLlib pipeline persists model artifacts to models/.
- Model → reports: metrics.json/csv + figures for audit and QA.
- Reports → dashboard: Streamlit reads only reports/dashboard aggregates.
Reproducibility + environment parity
- Makefile orchestrates the same steps locally or on Dataproc Serverless.
- config.yaml / config.gcp.yaml keep paths and Spark settings explicit.
- Storage utilities abstract local filesystem vs GCS without code changes.
Cloud migration (local → Dataproc Serverless)
- Switch configs: local paths → gs:// buckets, Spark master → serverless batch.
- Package source once and submit batches per pipeline step.
- Artifacts land in GCS reports/models buckets; dashboard can read GCS directly.
Dashboard artifacts
- Streamlit app reads exported aggregates so the dashboard is decoupled from training.
- Reports contain figures + metrics for quick review and QA checks.
Outputs and how to interpret them
- reports/ingestion_summary.json: row counts + time range for raw ingest sanity checks.
- reports/features_summary.json: feature row count + positive rate.
- reports/metrics.json: AUC plus threshold metrics for evaluation at a chosen cutoff.
- reports/figures/metrics.png: visual sanity check for model performance.

Security / Threat Model

Config-driven storage paths keep raw/processed/outputs deterministic across local and GCS.
No credentials are committed; GCP auth is handled by gcloud/gsutil locally.
Dataset is public and externalized to avoid bundling large data in the repo.
Dashboard reads precomputed aggregates only (reduces data exposure and compute costs).

Tradeoffs & Lessons

Key decisions — Spark over pandas for scale, Makefile as a single orchestrator, and config-only portability.

Trade-offs — serverless batches reduce ops but add cold-start overhead; precomputed dashboards trade freshness for speed.

Limitations — baseline model only; no experiment tracking server or model registry yet.

Next steps — MLflow tracking, CI validation of outputs, scheduled Dataproc runs, and data quality checks.

Results

A repeatable analytics ML pipeline that keeps feature builds, training results, and dashboards consistent across local and cloud environments.

Stack

PythonPySparkSpark MLlibGCSMakefileStreamlit

FAQ

Why Spark instead of pandas?

Spark handles larger datasets and keeps the same code path for local and cloud execution, which improves reproducibility.

How do you keep local and cloud runs consistent?

The pipeline is config-driven: switching config.yaml to config.gcp.yaml updates paths and Spark settings without code changes.

What outputs should I review first?

Start with reports/metrics.json and reports/figures/metrics.png, then open the Streamlit dashboard aggregates.

Back to Projects Explore more Case Studies

Cloud Analytics ML Pipeline