Course Outline

Section 1: Introduction to Hadoop for Government

  • History and foundational concepts of Hadoop
  • Hadoop ecosystem overview
  • Different distributions of Hadoop available for government use
  • High-level architecture of Hadoop
  • Common myths about Hadoop in the public sector
  • Challenges faced by government agencies when implementing Hadoop
  • Hardware and software requirements for Hadoop deployment
  • Laboratory exercise: Initial exploration of Hadoop for government applications

Section 2: HDFS for Government

  • Design and architecture principles of HDFS
  • Key concepts including horizontal scaling, replication, data locality, and rack awareness
  • Daemons in HDFS: NameNode, Secondary NameNode, DataNode
  • Communication mechanisms and heartbeat processes
  • Data integrity measures in HDFS for government data protection
  • Read and write paths in HDFS
  • NameNode High Availability (HA) and Federation configurations for enhanced reliability
  • Laboratory exercise: Interacting with HDFS for government data management

Section 3: MapReduce for Government

  • Concepts and architecture of MapReduce
  • Daemons in MapReduce Version 1 (MRV1): JobTracker and TaskTracker
  • Phases of MapReduce: Driver, Mapper, Shuffle/Sort, Reducer
  • Comparison between MapReduce Version 1 and Version 2 (YARN)
  • Internal mechanisms of the MapReduce process
  • Introduction to writing Java MapReduce programs for government applications
  • Laboratory exercise: Running a sample MapReduce program for government data analysis

Section 4: Pig for Government

  • Comparison between Pig and Java MapReduce for government use cases
  • Pig job flow and execution process
  • Pig Latin language syntax and features
  • Extract, Transform, Load (ETL) processes using Pig
  • Data transformations and joins in Pig
  • User-defined functions (UDFs) for custom data processing
  • Laboratory exercise: Writing Pig scripts to analyze government data

Section 5: Hive for Government

  • Architecture and design principles of Hive
  • Data types supported by Hive for government datasets
  • SQL support in Hive for querying large datasets
  • Creating and managing Hive tables for government data storage
  • Data partitioning strategies for optimized query performance
  • Joins and complex queries in Hive for government data analysis
  • Text processing capabilities in Hive for unstructured data
  • Laboratory exercise: Various exercises on processing government data with Hive

Section 6: HBase for Government

  • Concepts and architecture of HBase for government applications
  • Comparison between HBase, relational database management systems (RDBMS), and Cassandra
  • HBase Java API for developing custom applications for government use
  • Handling time series data in HBase for government operations
  • Schema design considerations for efficient data storage and retrieval in HBase
  • Laboratory exercise: Interacting with HBase using the shell, programming in the HBase Java API, and schema design exercises for government datasets

Requirements

  • Proficiency with the Java programming language (most programming exercises are in Java)
  • Comfortable in a Linux environment (ability to navigate the Linux command line and edit files using vi or nano)

Lab Environment

Zero Installation: There is no requirement for students to install Hadoop software on their personal devices. A fully operational Hadoop cluster will be provided for government use.

Students will need the following:

  • An SSH client (Linux and Mac systems come with built-in SSH clients; for Windows, Putty is recommended)
  • A web browser to access the cluster, Firefox is recommended
 28 Hours

Number of participants


Price per participant

Testimonials (5)

Upcoming Courses

Related Categories