Course Outline

  • Introduction
    • Overview of Hadoop's history and foundational concepts for government
    • Ecosystem components and their roles for government
    • Available distributions and their features for government use
    • High-level architecture and its implications for government operations
    • Common myths about Hadoop and their debunking for government audiences
    • Challenges in hardware and software integration for government deployments
    • Labs: Discussion of Big Data projects and problem-solving for government
  • Planning and Installation
    • Selection criteria for software and Hadoop distributions for government needs
    • Cluster sizing considerations and planning for future growth for government operations
    • Hardware and network selection guidelines for government deployments
    • Rack topology configuration for optimal performance in government settings
    • Installation procedures and best practices for government environments
    • Multi-tenancy configurations to support diverse government operations
    • Directory structure management and log handling for government compliance
    • Benchmarking techniques to ensure performance standards for government use
    • Labs: Cluster installation and performance benchmarking for government
  • HDFS Operations
    • Key concepts including horizontal scaling, replication, data locality, and rack awareness for government data management
    • Roles of nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode) in government HDFS environments
    • Health monitoring strategies to ensure reliable government data storage
    • Command-line and browser-based administration tools for efficient government management
    • Procedures for adding storage and replacing defective drives in government clusters
    • Labs: Familiarization with HDFS command lines for government users
  • Data Ingestion
    • Using Flume for log and data ingestion into HDFS in government settings
    • Sqoop for importing from SQL databases to HDFS and exporting back to SQL for government applications
    • Data warehousing with Hive for enhanced government analytics
    • Techniques for copying data between clusters using distcp in government environments
    • Utilizing S3 as a complementary storage solution to HDFS for government data management
    • Best practices and architectures for data ingestion in government contexts
    • Labs: Setting up and using Flume and Sqoop for government projects
  • MapReduce Operations and Administration
    • Comparison of parallel computing before MapReduce, focusing on HPC vs. Hadoop administration in government contexts
    • Managing MapReduce cluster loads for efficient government processing
    • Roles of nodes and daemons (JobTracker, TaskTracker) in government MapReduce clusters
    • Walkthrough of the MapReduce user interface for government users
    • Configuration settings for optimal MapReduce performance in government environments
    • Job configuration parameters and their impact on government workflows
    • Strategies for optimizing MapReduce operations in government settings
    • Guidance for programmers to ensure robust MapReduce implementations in government projects
    • Labs: Running MapReduce examples for government applications
  • YARN: New Architecture and Capabilities
    • Design goals and implementation architecture of YARN for enhanced government data processing
    • Introduction to new actors in YARN (ResourceManager, NodeManager, Application Master) for government users
    • Installation procedures for YARN in government clusters
    • Job scheduling techniques under YARN for efficient government resource management
    • Labs: Investigating job scheduling with YARN for government applications
  • Advanced Topics
    • Hardware monitoring strategies to ensure reliability in government Hadoop clusters
    • Cluster monitoring techniques for continuous performance assessment in government environments
    • Procedures for adding and removing servers, and upgrading Hadoop in government settings
    • Backup, recovery, and business continuity planning for government data integrity
    • Oozie job workflows for automating complex tasks in government operations
    • High availability (HA) configurations to ensure continuous operation in government Hadoop clusters
    • Hadoop Federation to support large-scale government data management
    • Securing government Hadoop clusters with Kerberos authentication
    • Labs: Setting up monitoring systems for government Hadoop clusters
  • Optional Tracks
    • Cloudera Manager for cluster administration, monitoring, and routine tasks; installation and usage within the Cloudera distribution environment (CDH5) for government
    • Ambari for cluster administration, monitoring, and routine tasks; installation and usage within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0) for government

Requirements

  • Familiarity with basic Linux system administration
  • Basic scripting skills

Knowledge of Hadoop and Distributed Computing is not required; these topics will be introduced and explained in the course.

Lab Environment

Zero Installation: There is no need to install Hadoop software on students’ machines. A functional Hadoop cluster will be provided for government use by students.

Students will need the following:

  • An SSH client (Linux and Mac systems already have SSH clients; for Windows, Putty is recommended)
  • A web browser to access the cluster. We recommend using Firefox with the FoxyProxy extension installed
 21 Hours

Number of participants


Price per participant

Testimonials (5)

Upcoming Courses

Related Categories