Course Outline

PySpark & Machine Learning

Module 1: Big Data & Spark Foundations

  • Overview of the Big Data ecosystem and the role of Spark in modern data platforms for government
  • Understanding Spark architecture: driver, executors, cluster manager, lazy evaluation, Directed Acyclic Graph (DAG), and execution planning
  • Differences between Resilient Distributed Datasets (RDD) and DataFrame APIs and when to use each approach
  • Creating and configuring SparkSession and understanding application configuration fundamentals for government

Module 2: PySpark DataFrames

  • Reading and writing data from enterprise sources and formats (CSV, JSON, Parquet, Delta) for government
  • Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins, and aggregations
  • Implementing advanced operations such as window functions, handling timestamps, and working with nested data
  • Applying data quality checks and writing reusable, maintainable PySpark code for government

Module 3: Processing Large Datasets Efficiently

  • Understanding performance fundamentals: partitioning strategies, shuffle behavior, caching, and persistence
  • Using optimization techniques including broadcast joins and execution plan analysis
  • Efficient processing of large datasets and best practices for scalable data workflows for government
  • Understanding schema evolution and modern storage formats used in enterprise environments for government

Module 4: Feature Engineering at Scale

  • Performing feature engineering with Spark MLlib: handling missing values, encoding categorical variables, and feature scaling
  • Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines for government
  • Introduction to feature selection and handling imbalanced datasets for government

Module 5: Machine Learning with Spark MLlib

  • Understanding MLlib architecture and the Estimator/Transformer pattern for government
  • Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest) for government
  • Comparing models and interpreting results in distributed Machine Learning workflows for government

Module 6: End-to-End ML Pipelines

  • Building end-to-end Machine Learning pipelines combining preprocessing, feature engineering, and modeling for government
  • Applying train/validation/test split strategies for government
  • Performing cross-validation and hyperparameter tuning using grid search and random search for government
  • Structuring reproducible Machine Learning experiments for government

Module 7: Model Evaluation & Practical ML Decision Making

  • Applying appropriate evaluation metrics for regression and classification problems for government
  • Identifying overfitting and underfitting and making practical model selection decisions for government
  • Interpreting feature importance and understanding model behavior for government

Module 8: Production & Enterprise Practices

  • Persisting and loading models in Spark for government
  • Implementing batch inference workflows on large datasets for government
  • Understanding the Machine Learning lifecycle in enterprise environments for government
  • Introduction to versioning, experiment tracking concepts, and basic testing strategies for government

 

Practical Outcome

  • Ability to work autonomously with PySpark for government
  • Ability to process large datasets efficiently for government
  • Ability to perform feature engineering at scale for government
  • Ability to build scalable Machine Learning pipelines for government

Requirements

This section is intentionally left blank for government use.

 21 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories