Course Outline

Introduction, Objectives, and Migration Strategy

  • Course goals, alignment with participant profiles, and success criteria
  • High-level migration strategies and risk considerations for government
  • Setting up workspaces, repositories, and lab datasets

Day 1 — Migration Fundamentals and Architecture

  • Lakehouse concepts, Delta Lake overview, and Databricks architecture for government
  • SMP vs MPP differences and implications for migration in the public sector
  • Medallion (Bronze→Silver→Gold) design and Unity Catalog overview for government workflows

Day 1 Lab — Translating a Stored Procedure

  • Hands-on migration of a sample stored procedure to a notebook, ensuring compliance with public sector standards
  • Mapping temp tables and cursors to DataFrame transformations for government datasets
  • Validation and comparison with original output to ensure data integrity

Day 2 — Advanced Delta Lake & Incremental Loading

  • ACID transactions, commit logs, versioning, and time travel for government data management
  • Auto Loader, MERGE INTO patterns, upserts, and schema evolution in a public sector context
  • OPTIMIZE, VACUUM, Z-ORDER, partitioning, and storage tuning for efficient government operations

Day 2 Lab — Incremental Ingestion & Optimization

  • Implementing Auto Loader ingestion and MERGE workflows to enhance data processing in government systems
  • Applying OPTIMIZE, Z-ORDER, and VACUUM; validating results for improved performance in public sector applications
  • Measuring read/write performance improvements to optimize government operations

Day 3 — SQL in Databricks, Performance & Debugging

  • Analytical SQL features: window functions, higher-order functions, JSON/array handling for enhanced government data analysis
  • Reading the Spark UI, DAGs, shuffles, stages, tasks, and bottleneck diagnosis to improve public sector workflows
  • Query tuning patterns: broadcast joins, hints, caching, and spill reduction for efficient government data processing

Day 3 Lab — SQL Refactoring & Performance Tuning

  • Refactor a heavy SQL process into optimized Spark SQL to enhance performance in government systems
  • Use Spark UI traces to identify and fix skew and shuffle issues, ensuring data integrity in public sector applications
  • Benchmark before/after and document tuning steps for transparent governance in government operations

Day 4 — Tactical PySpark: Replacing Procedural Logic

  • Spark execution model: driver, executors, lazy evaluation, and partitioning strategies for efficient government data processing
  • Transforming loops and cursors into vectorized DataFrame operations to enhance performance in public sector workflows
  • Modularization, UDFs/pandas UDFs, widgets, and reusable libraries for scalable government applications

Day 4 Lab — Refactoring Procedural Scripts

  • Refactor a procedural ETL script into modular PySpark notebooks to improve data processing in government systems
  • Introduce parametrization, unit-style tests, and reusable functions for robust public sector applications
  • Code review and best-practice checklist application to ensure compliance with government standards

Day 5 — Orchestration, End-to-End Pipeline & Best Practices

  • Databricks Workflows: job design, task dependencies, triggers, and error handling for seamless government operations
  • Designing incremental Medallion pipelines with quality rules and schema validation to ensure data integrity in public sector workflows
  • Integration with Git (GitHub/Azure DevOps), CI, and testing strategies for PySpark logic in government applications

Day 5 Lab — Build a Complete End-to-End Pipeline

  • Assemble Bronze→Silver→Gold pipeline orchestrated with Workflows to enhance data processing in government systems
  • Implement logging, auditing, retries, and automated validations for transparent governance in public sector operations
  • Run full pipeline, validate outputs, and prepare deployment notes for efficient government implementation

Operationalization, Governance, and Production Readiness

  • Unity Catalog governance, lineage, and access controls best practices for ensuring data integrity in government operations
  • Cost, cluster sizing, autoscaling, and job concurrency patterns to optimize resource utilization in public sector applications
  • Deployment checklists, rollback strategies, and runbook creation for seamless transition to production environments in government systems

Final Review, Knowledge Transfer, and Next Steps

  • Participant presentations of migration work and lessons learned, fostering knowledge sharing within the public sector
  • Gap analysis, recommended follow-up activities, and training materials handoff to support ongoing government initiatives
  • References, further learning paths, and support options for continuous improvement in government data management

Requirements

  • A comprehensive understanding of data engineering principles
  • Practical experience with SQL and stored procedures (Synapse / SQL Server)
  • Knowledge of ETL orchestration methodologies (Azure Data Factory or similar)

Audience for Government

  • Technology managers with a background in data engineering
  • Data engineers transitioning from procedural OLAP logic to Lakehouse patterns
  • Platform engineers overseeing the adoption of Databricks within their organizations
 35 Hours

Number of participants


Price per participant

Upcoming Courses

Related Categories