Apache Spark Project
For your Spark programming project, you will explore a large, real-world public dataset using PySpark and transform complex data into actionable insights, culminating in professional visualizations. The goal is to simulate a practical data analytics workflow akin to those found in industry: you will acquire and preprocess the data at scale using Spark, define and compute three significant business or analytical insights, and present each insight through well-crafted static plots. This project challenges you not only to harness distributed computing for big data processing, but also to bridge the gap between large-scale computation and clear, impactful communication of findings, preparing your analyses as compelling visual reports ready for BI or decision-making audiences. Through this assignment, you’ll develop the core skills of data wrangling, scalable analysis, and effective storytelling with data—making your technical results accessible and persuasive to both technical and non-technical stakeholders.
Objective
Write a PySpark program that analyzes one large public dataset, generates at least three different business insights, and produces results as static plots using matplotlib or any other library capable of creating static (non-interactive) visualizations.
Steps
- Select Your Dataset
Choose one of:
- NYC Yellow Taxi Trip Data TLC Trip Record Data
- MovieLens Movie Ratings MovieLens 20M Dataset
- COVID-19 Open Data Our World in Data/Covid-19
- Define Three Insights Pick and clearly describe three insights. Each should be significant and suitable for visual representation (e.g., trends, comparisons, distributions).
- Data Processing and Analysis
- Ingest and process the dataset using PySpark DataFrames and/or SQL.
- Perform cleaning, type conversions, and any necessary preprocessing.
- Carefully compute your chosen insights.
- Visualization Requirement
- For each insight, generate at least one static plot using matplotlib, seaborn, or any comparable static visualization package.
- Each plot must be meaningful: Examples include time series charts, bar plots, histograms, heatmaps, scatter plots, or maps.
- You may convert Spark DataFrames to Pandas (using .toPandas()) before plotting if needed, but only after the spark-based processing is complete.
- Plots should be saved as PNG, JPG, or PDF images and displayed within the notebook/script.
- Output for BI Visualization
- Clearly display and save each generated plot.
- For each insight, include:
- A title and caption/legend explaining what is being shown
- A markdown cell with:
- The objective of the insight
- The business or analytical value
- Deliverables
- Well-commented PySpark notebook or script (.ipynb or .py) that produces and saves the required plots.
- Short README (markdown) that covers:
- Data source and description
- The three insights and why they were chosen
- How to run the program (Spark and plotting requirements)
- Resulting plot files (if submitting outside notebook format), or embedded plots if using Jupyter.
- (Optional) Extension
- You may go further by adding additional advanced Spark features (MLlib, GraphFrames, etc.), as desired.
Additional Guidelines
- Use Spark for heavy data processing—only use Pandas for small dataframes and plotting.
- Keep visualizations clear and professional.
- Label axes, include legends where necessary, and provide descriptive titles for all plots.
- Save plot images in a designated output folder or display inline within the notebook. This ensures students not only practice Spark data analysis but also learn to communicate results visually using static plots appropriate for BI reporting.