Project Overview
This platform ingests high-velocity data from multiple sources (web, IoT, logs), applies streaming transformations using Apache Spark, and exposes live KPIs via user-friendly dashboards. Airflow handles pipeline orchestration, monitoring, and conditional workflow logic.
Key Outcomes
- Real-time reporting with route-based event aggregation and business metrics refreshed every minute.
- Modular pipeline templates enable easy scaling for future data sources and analytics needs.
- Reduced time-to-insight from 12+ hours (legacy batch) to under 2 minutes.
What I Did
- Architected and coded end-to-end pipeline modules, tuning Spark for high-throughput workloads.
- Wrote comprehensive documentation and pipeline onboarding guides.
- Led training and knowledge transfer for data engineering team.
Additional Resources