Course Outline
Section 1: Introduction to Hadoop for Government
- History and foundational concepts of Hadoop
- Hadoop ecosystem overview
- Different distributions of Hadoop available for government use
- High-level architecture of Hadoop
- Common myths about Hadoop in the public sector
- Challenges faced by government agencies when implementing Hadoop
- Hardware and software requirements for Hadoop deployment
- Laboratory exercise: Initial exploration of Hadoop for government applications
Section 2: HDFS for Government
- Design and architecture principles of HDFS
- Key concepts including horizontal scaling, replication, data locality, and rack awareness
- Daemons in HDFS: NameNode, Secondary NameNode, DataNode
- Communication mechanisms and heartbeat processes
- Data integrity measures in HDFS for government data protection
- Read and write paths in HDFS
- NameNode High Availability (HA) and Federation configurations for enhanced reliability
- Laboratory exercise: Interacting with HDFS for government data management
Section 3: MapReduce for Government
- Concepts and architecture of MapReduce
- Daemons in MapReduce Version 1 (MRV1): JobTracker and TaskTracker
- Phases of MapReduce: Driver, Mapper, Shuffle/Sort, Reducer
- Comparison between MapReduce Version 1 and Version 2 (YARN)
- Internal mechanisms of the MapReduce process
- Introduction to writing Java MapReduce programs for government applications
- Laboratory exercise: Running a sample MapReduce program for government data analysis
Section 4: Pig for Government
- Comparison between Pig and Java MapReduce for government use cases
- Pig job flow and execution process
- Pig Latin language syntax and features
- Extract, Transform, Load (ETL) processes using Pig
- Data transformations and joins in Pig
- User-defined functions (UDFs) for custom data processing
- Laboratory exercise: Writing Pig scripts to analyze government data
Section 5: Hive for Government
- Architecture and design principles of Hive
- Data types supported by Hive for government datasets
- SQL support in Hive for querying large datasets
- Creating and managing Hive tables for government data storage
- Data partitioning strategies for optimized query performance
- Joins and complex queries in Hive for government data analysis
- Text processing capabilities in Hive for unstructured data
- Laboratory exercise: Various exercises on processing government data with Hive
Section 6: HBase for Government
- Concepts and architecture of HBase for government applications
- Comparison between HBase, relational database management systems (RDBMS), and Cassandra
- HBase Java API for developing custom applications for government use
- Handling time series data in HBase for government operations
- Schema design considerations for efficient data storage and retrieval in HBase
- Laboratory exercise: Interacting with HBase using the shell, programming in the HBase Java API, and schema design exercises for government datasets
Requirements
- Proficiency with the Java programming language (most programming exercises are in Java)
- Comfortable in a Linux environment (ability to navigate the Linux command line and edit files using vi or nano)
Lab Environment
Zero Installation: There is no requirement for students to install Hadoop software on their personal devices. A fully operational Hadoop cluster will be provided for government use.
Students will need the following:
- An SSH client (Linux and Mac systems come with built-in SSH clients; for Windows, Putty is recommended)
- A web browser to access the cluster, Firefox is recommended
Testimonials (5)
The live examples
Ahmet Bolat - Accenture Industrial SS
Course - Python, Spark, and Hadoop for Big Data
During the exercises, James explained me every step whereever I was getting stuck in more detail. I was completely new to NIFI. He explained the actual purpose of NIFI, even the basics such as open source. He covered every concept of Nifi starting from Beginner Level to Developer Level.
Firdous Hashim Ali - MOD A BLOCK
Course - Apache NiFi for Administrators
That I had it in the first place.
Peter Scales - CACI Ltd
Course - Apache NiFi for Developers
practical things of doing, also theory was served good by Ajay
Dominik Mazur - Capgemini Polska Sp. z o.o.
Course - Hadoop Administration on MapR
The VM I liked very much The Teacher was very knowledgeable regarding the topic as well as other topics, he was very nice and friendly I liked the facility in Dubai.