Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
Introduction:
- Overview of Apache Spark within the Hadoop Ecosystem for government applications
- Brief introduction to Python and Scala for data processing in government contexts
Basics (Theory):
- Architectural Overview of Apache Spark for government use cases
- Resilient Distributed Datasets (RDDs) and their role in scalable data processing for government
- Transformations and Actions: Core operations for efficient data manipulation for government
- Understanding Stages, Tasks, and Dependencies in the Spark execution model for government workflows
Using Databricks Environment to Understand the Basics (Hands-On Workshop):
- Exercises with RDD API: Applying transformations and actions for government data sets
- Basic action and transformation functions: Practical examples for government datasets
- PairRDD: Working with key-value pairs in government applications
- Join Operations: Combining datasets efficiently for government analytics
- Caching Strategies: Optimizing performance for government data processing tasks
- Exercises using DataFrame API: Advanced data manipulation techniques for government
- SparkSQL: Querying large datasets with SQL-like syntax for government use
- DataFrame Operations: Select, filter, group, and sort data for government reports
- User-Defined Functions (UDFs): Customizing data processing for specific government needs
- Introduction to the DataSet API: Enhanced type safety for government datasets
- Streaming Data Processing: Real-time analytics for dynamic government operations
Using AWS Environment to Understand Deployment (Hands-On Workshop):
- Basics of AWS Glue: Serverless ETL services for government data integration
- Comparing AWS EMR and AWS Glue: Evaluating options for government data processing
- Example Jobs in Both Environments: Practical implementations for government projects
- Understanding Pros and Cons: Making informed decisions for government IT infrastructure
Extra:
- Introduction to Apache Airflow Orchestration: Workflow management for complex government data pipelines
Requirements
Programming skills (preferably Python, Scala) are essential for government projects that require robust data handling and analysis. Proficiency in SQL basics is also crucial to effectively manage and query databases for government applications.
21 Hours
Testimonials (3)
Having hands on session / assignments
Poornima Chenthamarakshan - Intelligent Medical Objects
Course - Apache Spark in the Cloud
1. Right balance between high level concepts and technical details. 2. Andras is very knowledgeable about his teaching. 3. Exercise
Steven Wu - Intelligent Medical Objects
Course - Apache Spark in the Cloud
Get to learn spark streaming , databricks and aws redshift