Course Outline

  • Introduction
    • Hadoop history, concepts for government
    • Ecosystem
    • Distributions
    • High-level architecture
    • Hadoop myths
    • Hadoop challenges (hardware / software)
    • Labs: discuss your Big Data projects and problems for government
  • Planning and Installation
    • Selecting software, Hadoop distributions for government
    • Sizing the cluster, planning for growth
    • Selecting hardware and network
    • Rack topology
    • Installation
    • Multi-tenancy
    • Directory structure, logs
    • Benchmarking
    • Labs: cluster install, run performance benchmarks for government
  • HDFS Operations
    • Concepts (horizontal scaling, replication, data locality, rack awareness)
    • Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
    • Health monitoring
    • Command-line and browser-based administration
    • Adding storage, replacing defective drives
    • Labs: getting familiar with HDFS command lines for government
  • Data Ingestion
    • Flume for logs and other data ingestion into HDFS
    • Sqoop for importing from SQL databases to HDFS, as well as exporting back to SQL
    • Hadoop data warehousing with Hive
    • Copying data between clusters (distcp)
    • Using S3 as complementary to HDFS
    • Data ingestion best practices and architectures for government
    • Labs: setting up and using Flume, the same for Sqoop for government
  • MapReduce Operations and Administration
    • Parallel computing before MapReduce: compare HPC vs Hadoop administration for government
    • MapReduce cluster loads
    • Nodes and daemons (JobTracker, TaskTracker)
    • MapReduce UI walkthrough
    • MapReduce configuration
    • Job config
    • Optimizing MapReduce for government
    • Fool-proofing MR: what to tell your programmers for government
    • Labs: running MapReduce examples for government
  • YARN: New Architecture and Capabilities
    • YARN design goals and implementation architecture for government
    • New actors: ResourceManager, NodeManager, Application Master
    • Installing YARN for government
    • Job scheduling under YARN for government
    • Labs: investigate job scheduling for government
  • Advanced Topics
    • Hardware monitoring
    • Cluster monitoring for government
    • Adding and removing servers, upgrading Hadoop
    • Backup, recovery, and business continuity planning for government
    • Oozie job workflows for government
    • Hadoop high availability (HA) for government
    • Hadoop Federation for government
    • Securing your cluster with Kerberos for government
    • Labs: set up monitoring for government
  • Optional Tracks
    • Cloudera Manager for cluster administration, monitoring, and routine tasks; installation, use. In this track, all exercises and labs are performed within the Cloudera distribution environment (CDH5) for government
    • Ambari for cluster administration, monitoring, and routine tasks; installation, use. In this track, all exercises and labs are performed within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0) for government

Requirements

  • Comfortable with basic Linux system administration
  • Basic scripting skills

Knowledge of Hadoop and Distributed Computing is not required but will be introduced and explained in the course.

Lab Environment

Zero Installation: There is no need to install Hadoop software on students’ machines. A fully functional Hadoop cluster will be provided for government use by the participants.

Students will need the following:

  • An SSH client (Linux and Mac systems already include SSH clients; for Windows, Putty is recommended)
  • A browser to access the cluster. We recommend using Firefox with the FoxyProxy extension installed.
 21 Hours

Number of participants


Price per participant

Testimonials (5)

Upcoming Courses

Related Categories