Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
PySpark & Machine Learning
Module 1: Big Data & Spark Foundations
- Overview of the Big Data ecosystem and the role of Spark in modern data platforms for government
- Understanding Spark architecture: driver, executors, cluster manager, lazy evaluation, Directed Acyclic Graph (DAG), and execution planning
- Differences between Resilient Distributed Datasets (RDD) and DataFrame APIs and when to use each approach
- Creating and configuring SparkSession and understanding application configuration fundamentals for government
Module 2: PySpark DataFrames
- Reading and writing data from enterprise sources and formats (CSV, JSON, Parquet, Delta) for government
- Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins, and aggregations
- Implementing advanced operations such as window functions, handling timestamps, and working with nested data
- Applying data quality checks and writing reusable, maintainable PySpark code for government
Module 3: Processing Large Datasets Efficiently
- Understanding performance fundamentals: partitioning strategies, shuffle behavior, caching, and persistence
- Using optimization techniques including broadcast joins and execution plan analysis
- Efficient processing of large datasets and best practices for scalable data workflows for government
- Understanding schema evolution and modern storage formats used in enterprise environments for government
Module 4: Feature Engineering at Scale
- Performing feature engineering with Spark MLlib: handling missing values, encoding categorical variables, and feature scaling
- Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines for government
- Introduction to feature selection and handling imbalanced datasets for government
Module 5: Machine Learning with Spark MLlib
- Understanding MLlib architecture and the Estimator/Transformer pattern for government
- Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest) for government
- Comparing models and interpreting results in distributed Machine Learning workflows for government
Module 6: End-to-End ML Pipelines
- Building end-to-end Machine Learning pipelines combining preprocessing, feature engineering, and modeling for government
- Applying train/validation/test split strategies for government
- Performing cross-validation and hyperparameter tuning using grid search and random search for government
- Structuring reproducible Machine Learning experiments for government
Module 7: Model Evaluation & Practical ML Decision Making
- Applying appropriate evaluation metrics for regression and classification problems for government
- Identifying overfitting and underfitting and making practical model selection decisions for government
- Interpreting feature importance and understanding model behavior for government
Module 8: Production & Enterprise Practices
- Persisting and loading models in Spark for government
- Implementing batch inference workflows on large datasets for government
- Understanding the Machine Learning lifecycle in enterprise environments for government
- Introduction to versioning, experiment tracking concepts, and basic testing strategies for government
Practical Outcome
- Ability to work autonomously with PySpark for government
- Ability to process large datasets efficiently for government
- Ability to perform feature engineering at scale for government
- Ability to build scalable Machine Learning pipelines for government
Requirements
This section is intentionally left blank for government use.
21 Hours
Testimonials (1)
practice tasks