Course Outline

Introduction, Objectives, and Migration Strategy

  • Course goals, alignment of participant profiles, and criteria for success
  • High-level migration approaches and associated risk considerations
  • Setting up workspaces, repositories, and laboratory datasets

Day 1 — Migration Fundamentals and Architecture

  • Concepts of Lakehouse, overview of Delta Lake, and Databricks architecture for government
  • Differences between SMP and MPP and their implications for migration
  • Design of the Medallion data pipeline (Bronze→Silver→Gold) and an overview of Unity Catalog

Day 1 Lab — Translating a Stored Procedure

  • Hands-on migration of a sample stored procedure to a notebook for government use
  • Mapping temporary tables and cursors to DataFrame transformations
  • Validation and comparison with the original output

Day 2 — Advanced Delta Lake & Incremental Loading

  • ACID transactions, commit logs, versioning, and time travel capabilities in Delta Lake for government
  • Auto Loader, MERGE INTO patterns, upserts, and schema evolution techniques
  • Optimization strategies including OPTIMIZE, VACUUM, Z-ORDER, partitioning, and storage tuning

Day 2 Lab — Incremental Ingestion & Optimization

  • Implementing Auto Loader ingestion and MERGE workflows for government data
  • Applying OPTIMIZE, Z-ORDER, and VACUUM; validating results
  • Measuring improvements in read/write performance

Day 3 — SQL in Databricks, Performance & Debugging

  • Analytical SQL features: window functions, higher-order functions, JSON/array handling for government datasets
  • Reading the Spark UI, understanding DAGs, shuffles, stages, tasks, and diagnosing bottlenecks
  • Query tuning patterns including broadcast joins, hints, caching, and reducing spills

Day 3 Lab — SQL Refactoring & Performance Tuning

  • Refactoring a heavy SQL process into optimized Spark SQL for government operations
  • Using Spark UI traces to identify and fix skew and shuffle issues
  • Benchmarking before and after, documenting tuning steps

Day 4 — Tactical PySpark: Replacing Procedural Logic

  • Spark execution model: driver, executors, lazy evaluation, and partitioning strategies for government data processing
  • Transforming loops and cursors into vectorized DataFrame operations
  • Modularization techniques, UDFs/pandas UDFs, widgets, and reusable libraries

Day 4 Lab — Refactoring Procedural Scripts

  • Refactoring a procedural ETL script into modular PySpark notebooks for government use
  • Introducing parametrization, unit-style tests, and reusable functions
  • Conducting code reviews and applying best-practice checklists

Day 5 — Orchestration, End-to-End Pipeline & Best Practices

  • Databricks Workflows: job design, task dependencies, triggers, and error handling for government operations
  • Designing incremental Medallion pipelines with quality rules and schema validation
  • Integration with Git (GitHub/Azure DevOps), CI, and testing strategies for PySpark logic in government environments

Day 5 Lab — Build a Complete End-to-End Pipeline

  • Assembling a Bronze→Silver→Gold pipeline orchestrated with Workflows for government use
  • Implementing logging, auditing, retries, and automated validations
  • Running the full pipeline, validating outputs, and preparing deployment notes

Operationalization, Governance, and Production Readiness

  • Best practices for Unity Catalog governance, lineage, and access controls in government environments
  • Cost management, cluster sizing, autoscaling, and job concurrency patterns for government operations
  • Deployment checklists, rollback strategies, and runbook creation for government use

Final Review, Knowledge Transfer, and Next Steps

  • Participant presentations of migration work and lessons learned for government projects
  • Gap analysis, recommended follow-up activities, and handoff of training materials for government teams
  • References, further learning paths, and support options for government personnel

Requirements

  • A solid understanding of data engineering principles
  • Practical experience with SQL and stored procedures (Synapse / SQL Server)
  • Knowledge of ETL orchestration concepts (Azure Data Factory or similar)

Audience

  • Technology managers with a background in data engineering
  • Data engineers transitioning from procedural OLAP logic to Lakehouse patterns
  • Platform engineers responsible for the adoption of Databricks, particularly for government projects
 35 Hours

Number of participants


Price per participant

Upcoming Courses

Related Categories