homecloud computingcloud data Processingcloud of thingsclound of networks
cloud data processing

Hadoop System Administration

Duration: 3 Days

Course Background

This three-day Apache Hadoop training is aimed at administrators who will be involved with the deployment and management of Apache Hadoop clusters. The course will cover the details of installation, provisioning and ongoing resource management within a Hadoop cluster.  It will also deal with the monitoring and optimisation of Hadoop clusters.

The topics covered in the course will include

  • Sizing and deployment of a Hadoop cluster
  • First time deployment and ongoing maintenance of the nodes in a cluster
  • Cluster balancing and performance tuning
  • How to integrate status and health checks into existing monitoring tools
  • Recovery from NameNode and DataNode failures
  • Designing and configuring for high availability
  • Hadoop Security

Course Prerequisites and Target Audience

This course is for IT administrators and operators having a basic knowledge of Linux its tools and utilities and Bash scripting. The course does not require any knowledge of Hadoop.

Course Outline

  • Hadoop - History and Architecture
  • The Rationale for Apache Hadoop
  • From Requirements Analysis to Planning a Hadoop Cluster
  • Installation and Initial Configuration of Hadoop
  • Installing Hadoop daemons
  • Installing and Configuring Hive, Impala, and Pig
  • Benchmarking Hadoop
  • Creating a multi-user environment in Hadoop
  • Overview of the Hadoop directory structure
  • Logging and log files in Hadoop
  • HDFS
    • Setting basic HDFS configuration parameters
    • Configuring block allocation, redundancy and replication
    • Loading Data into HDFS
    • HDFS Command Reference
    • DFSAdmin Command Reference
    • HDFS Permissions and Security
  • MapReduce
    • Conceptual overview of MapReduce
    • Installing and setting up a MapReduce environment
    • Delivering redundant load balancing via Rack Awareness
  • Maintenance Related HDFS Tasks
    • Rebalancing Blocks
    • Copying Large Sets of Files
    • Commissioning and Decommissioning Nodes
    • Verifying File System Health
    • NameNode Back up and Recovery
    • Rack Awareness
  • Maximising HDFS Robustness
    • Creating a fault-tolerant file system
    • Isolating single points of failure
    • Maintaining High Availability
    • Triggering manual failover
    • Automating failover with Zookeeper
    • Effective use of the NameNode Federation
    • Extending HDFS resources
    • Managing the namespace volumes
    • YARN and the YARN architecture
  • Cluster Health and Monitoring
    • Understanding Hadoop configuration files
    • Setting quotas to constrain HDFS utilization
    • Monitoring a cluster with Nagios and Ganglia
    • dfsadmin and mradmin
    • Prioritising access to MapReduce using schedulers
    • Starting and stopping Hadoop daemons
    • Administering MapReduce
    • Managing MapReduce jobs
    • Monitoring and Troubleshooting
  • Hadoop Extensions
    • SQL-like querying with Hive
    • MapReduce jobs with Pig
    • Imposing a tabular view on HDFS with HBase
    • Realising Data Ingress and Egress
    • Moving bulk data into and out of Hadoop
    • Transmitting HDFS data over HTTP with WebHDFS
    • Collecting multi-sourced log files with Flume
    • Importing and exporting relational information with Sqoop
    • Configuring Oozie to schedule workflows
  • More Administration
    • Planning for Backup, Recovery and Security
    • Dealing with hardware failures
    • Securing a Hadoop cluster
    • Oozie Administration
    • HCatalog and Hive Administration