+91-9686502645

Big Data / Data Science

Big Data Hadoop Development


Hadoop is an open source, the Java-based programming framework that enables the storage and processing of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software. Big Data Hadoop Training, makes it possible to run applications on systems with thousands of hardware nodes, and thousands of terabytes of data. It is a  distributed file system that facilitates rapid data transfer rates among nodes and allows the system to continue operating in case of a node failure. This approach lowers the risk of catastrophic system failure and unexpected data loss

 

TARGET AUDIENCE:

Apache Hadoop

Hadoop Installation & Setup

  • Hadoop 2.x Cluster Architecture
  • Federation and High Availability
  • Typical Production Cluster setup
  • Hadoop Cluster Modes
  • Common Hadoop Shell Commands
  • Hadoop 2.x Configuration Files
  • Cloudera Single node cluster
  • Hive
  • Pig
  • Sqoop
  • Flume
  • Scala

Big Data Hadoop Training: Understanding HDFS & MapReduce

  • Introducing to Big Data & Hadoop
  • What is Big Data and where does Hadoop fit in
  • Two important Hadoop ecosystem components namely Map Reduce and HDFS
  • In-depth Hadoop Distributed File System – Replications, Block Size, Secondary Name node
  • Big Data Hadoop Training with High Availability
  • In-depth YARN – Resource Manager, Node Manager.
  • Hands-on Exercise –
  • Working with HDFS
  • Replicating the data
  • Determining block size
  • Familiarizing with Namenode and Datanode.
  • Deep Dive in MapReduce
  • Detailed understanding of the working of MapReduce
  • The mapping and reducing process
  • The working of Driver
  • Combiners
  • Partitioners
  • Input Formats
  • Output Formats
  • Shuffle and Sort
  • Hands-on Exercise
    • The detailed methodology for writing the Word Count Program in MapReduce
    • Writing custom partitioner
  • MapReduce with Combiner
  • Local Job Runner Mode
  • Unit Test, ToolRunner
  • Map Side Join
  • Reduce Side Join
  • Using Counters
  • Joining two datasets using Map-Side Join Vs Reduce-Side Join
  •  Hadoop Administration – Multi-Node Cluster Setup using Amazon EC2
  • Create a four-node Hadoop cluster setup
  • Running the MapReduce Jobs on the Hadoop cluster
  • Successfully running the MapReduce code
  • Working with the Cloudera Manager setup.
  • Hands-on Exercise
  • The method to build a multi-node Hadoop cluster using an Amazon EC2 instance
  • Working with the Cloudera Manager.
  • Hadoop Administration – Cluster Configuration
  • Hadoop configuration
  • Importance of Hadoop configuration file
  • Various parameters and values of configuration
  • HDFS parameters and MapReduce parameters
  • Setting up the Hadoop environment
  • Include and Exclude configuration files
  • Administration and maintenance of Name node
  • Data node directory structures and files
  • File system image and Edit log
  • Hands-on Exercise
  • The method to do performance tuning of MapReduce program.

Hadoop Administration – Maintenance, Monitoring and Troubleshooting

  • Introduction to the Checkpoint Procedure
  • Namenode failure and how to ensure the recovery procedure
  • Safe Mode
  • Metadata and Data backup
  • Various potential problems and solutions
  • What to look for
  • How to add and remove nodes.
  • Hands-on Exercise
    • How to go about ensuring the MapReduce Filesystem Recovery for various different scenarios
    • JMX monitoring of the Hadoop cluster
    • How to use the logs and stack traces for monitoring and troubleshooting
    • Using the Job Scheduler for scheduling jobs in the same cluster
    • Getting the MapReduce job submission flow
    • FIFO schedule
  • Getting to know the Fair Scheduler and its configuration.
  • ETL Connectivity with Hadoop Ecosystem
  • How ETL tools work in Big data Industry
  • Introduction to ETL and Data warehousing
  • Working with prominent use cases of Big data in ETL industry
  • End to End ETL PoC showing big data integration with ETL tool.
  • Hands-on Exercise
    • Connecting to HDFS from ETL tool and moving data from Local system to HDFS
    • Moving Data from DBMS to HDFS
    • Working with Hive with ETL Tool
    • Creating the Map-Reduce job in ETL tool

 

Following Topics will be available only in self-paced Mode.

Why Testing is important..?
  • Hadoop Application Testing
  • Why testing is important
  • Unit testing & Integration testing
  • Performance Testing Diagnostics
  • Nightly QA test
  • Benchmark and end to end tests
  • Functional testing
  • Release certification testing
  • Security testing
  • Scalability Testing Commissioning and Decommissioning of Data Nodes Testing
  • Reliability testing
  • Release testing

Big Data Hadoop Training and Development
Hadoop Application Testing
  • Why testing is important
  • Unit testing
    §  Integration testing
  • Performance testing
    Diagnostics
  • Nightly QA test
  • Benchmark and end to end tests
  • Functional testing
  • Release certification testing
  • Security testing
  • Scalability Testing
  • Commissioning and Decommissioning of Data Nodes Testing
  • Reliability testing
  • Release testing

 

Roles and Responsibilities of Hadoop Testing Professional

  • Understanding the Requirement,
  • Preparation of the Testing Estimation
  • Test Cases
  • Test Data
  • Testbed creation
  • Test Execution
  • Defect Reporting
  • Defect Retest
  • Daily Status report delivery
  • Test completion
  • ETL testing at every stage (HDFS, Hive, HBase) while loading the input (logs/files/records etc)
  • Using scoop/flume which includes but not limited to data verification
  • Reconciliation
  • User Authorization and Authentication testing (Groups, Users, Privileges etc)
  • Report defects to the development team or manager and driving them to closure
  • Consolidate all the defects and create defect reports
  • Validating new feature and issues in Core Hadoop.

 

A framework called MR Unit for Testing of Map-Reduce Programs

  • Report defects to the development team or manager and driving them to closure
  • Consolidate all the defects and create defect reports
  • Responsible for creating a testing Framework called MR Unit for testing of Map-Reduce programs.

 

Unit Testing

  • Automation testing using the OOZIE
  • Data validation using the query surge tool.

 

Test Execution

  • The test plan for HDFS upgrade
  • Test automation and result

Test Plan Strategy and writing Test Cases for testing Hadoop   Application

  • How to test install and configure

 

Apache HBase

HBase is an open-source, non-relational, distributed database modeled after Google’s Bigtable and is written in Java. It provides a fault-tolerant way of storing large quantities of sparse data. HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original Bigtable paper. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs. HBase is a column-oriented key-value data store and has been idolized widely because of its lineage with Hadoop and HDFS. HBase runs on top of HDFS and is well-suited for faster read and write operations on large datasets with high throughput and low input/output latency.

HBase Overview

  • Getting started with HBase
  • Core concepts of HBase
  • Understanding HBase with an Example

 Architecture of NoSQL

  • Why HBase?
  • Where to use HBase?
  • What is NoSQL?

HBase Data Modeling

  • HDFS vs.HBase
  • HBase Use Cases
  • Data Modeling HBase

 HBase Cluster Components

  • HBase Architecture
  • Main components of HBase Cluster

 HBase API and Advanced Operations

  • HBase Shell
  • HBase API
  • Primary Operations
  • Advanced Operations

Integration of Hive with HBase

  • Create a Table and Insert Data into it
  • Integration of Hive with HBase,
  • Load Utility

 File loading with both load Utility

  • Putting Folder to VM
  • File loading with both load Utility

Apache Pig

  • Apache Pig introduction, its various features
  • Various data types and schema in Hive
  • Available functions in Pig
  • Hive Bags
  • Tuples

 

Apache Hive

Apache Hive is a data warehouse software built on top of Apache Hadoop for providing data summarization query, and analysis. Hive gives a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

Introduction to Hive

  • Introducing Hadoop Hive
  • The detailed architecture of Hive
  • Comparing Hive with Pig and RDBMS
  • Working with Hive Query Language
  • Creation of database, table, group by and other clauses, the various types of Hive tables
  • Hcatalog
  • Storing the Hive Results
  • Hive partitioning and Buckets.
  • The indexing in Hive
  • The Map side Join in Hive
  • Working with complex data types
  • Hive User-defined Functions
  • Apache Impala
  • Introduction to Impala
  • Comparing Hive with Impala
  • The detailed architecture of Impala
  • Apache Flume
  • Introduction to Flume and its Architecture

Apache Kafka

Apache Kafka is an open-source stream-processing software platform written in Scala and Java. It aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data. Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library

What is Kafka – An Introduction

  • Understanding what is Apache Kafka
  • The various components and use cases of Kafka
  • Implementing Kafka on a single node.

Multi-Broker Kafka Implementation

  • Learning about the Kafka terminology
  • Deploying single node Kafka with independent Zookeeper
  • Adding replication in Kafka
  • Working with Partitioning and Brokers
  • Understanding Kafka consumers
  • Kafka Writes terminology
  • Various failure handling scenarios in Kafka.

Multi-Node Cluster Setup

  • Introduction to multi-node cluster setup in Kafka
  • Various administration commands
  • Leadership balancing and partition rebalancing
  • Graceful shutdown of Kafka Brokers and tasks
  • Working with the Partition Reassignment Tool
  • Cluster expending
  • Assigning Custom Partition
  • Removing of a Broker and improving Replication Factor of Partitions.

Integrate Flume with Kafka

  • Understanding the need for Kafka Integration
  • Successfully integrating it with Apache Flume
  • Steps in the integration of Flume with Kafka as a Source.

Kafka API

  • Detailed understanding of the Kafka and Flume Integration
  • Deploying Kafka as a Sink and as a Channel
  • Introduction to PyKafka API
  • Setting up the PyKafka Environment.

Producers & Consumers

  • Connecting Kafka using PyKafka
  • Writing your own Kafka Producers and Consumers
  • Writing a random JSON Producer
  • Writing a Consumer to read the messages on a topic
  • Writing and working with a File Reader Producer
  • Writing a Consumer to store topics data into a file.

Apache Sqoop

  • Introduction to Apache Sqoop
    • Sqoop overview
    • Basic imports and exports
    • How to improve Sqoop performance
    • The limitation of Sqoop,

Apache Storm

Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. It uses custom created “spouts” and “bolts” to define information sources and manipulations to allow batch, distributed processing of streaming data. A Storm application is designed as a “topology” in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline.

Understanding Architecture of Storm

  • Big Data Hadoop Training
  • Understanding Hadoop distributed computing
  • The Bayesian Law
  • Deploying Storm for real-time analytics
  • The Apache Storm features
  • Comparing Storm with Hadoop
  • Storm execution
  • Learning about Tuple
  • Spout
  • Bolt

Installation of Apache storm

  • Installing Apache Storm
  • Various types of run modes of Storm.

Introduction to Apache Storm

  • Understanding Apache Storm and the data model.

Apache Kafka Installation

  • Installation of Apache Kafka and its configuration.

Apache Storm Advanced

  • Understanding of advanced Storm topics like Spouts
  • Bolts
  • Stream Groupings
  • Topology and its Lifecycle
  • Learning about Guaranteed Message Processing.

Storm Topology

  • Various Grouping types in Storm
  • Reliable and unreliable messages
  • Bolt structure and life cycle
  • Understanding Trident topology for failure handling, process
  • Call Log Analysis Topology for analyzing call logs for calls made from one number to another.

Overview of Trident

  • Understanding of Trident Spouts and its different types
  • The various Trident Spout interface and components
  • Familiarizing with Trident Filter
  • Aggregator and Functions
  • A practical and hands-on use case for solving call log problem using Storm Trident.

Storm Components & classes

  • Various components
  • Classes and interfaces in a storm like – Base Rich Bolt Class, iRichBolt Interface, iRichSpout Interface
  • Base Rich Spout class and the various methodology of working with them.

Cassandra Introduction

  • Understanding Cassandra, its core concepts, its strengths, and deployment.

Boot Strapping

  • Twitter Boot Stripping
  • Detailed understanding of Boot Stripping
  • Concepts of Storm
  • Storm Development Environment.

Apache Splunk

Splunk is a software that is used for searching, monitoring, and analyzing machine-generated big data, via a Web-style interface. Splunk captures, indexes, and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards, and visualizations

Splunk Development concepts

  • Introduction to Splunk
  • Splunk developer roles and responsibilities

Basic Searching

  • Writing Splunk query for search
  • Autocomplete to build a search, time range, refine a search
  • Work with events
  • Identify the contents of the search
  • Control a search job
  • Hands-on Exercise
  • Write a basic search query

Using Fields in Searches

  • Understand Fields
  • Use Fields in Search
  • Use Fields Sidebar
  • Regex field extraction
  • Using Field Extractor (FX)
  • Delimiter field Extraction using FX
  • Hands-on Exercise
    • Use Fields in Search,
    • Use Fields Sidebar
    • Use Field Extractor (FX)
    • Delimit field Extraction using FX

Saving and Scheduling Searches

  • Writing Splunk query for search, sharing, saving, scheduling and exporting search results
  • Hands-on Exercise
    • Schedule a search, Save a search result, Share and export a search result

Creating Alerts

  • Creation of alert
  • Explaining alerts
  • Viewing fired alerts
  • Hands-on ExerciseCreate an alert, view fired alerts

Scheduled Reports

  • Describe and Configure Scheduled Reports

Tags and Event Types

  • Introduction to Tags in Splunk
  • Deploying Tags for Splunk search
  • Understanding event types and utility
  • Generating and implementing event types in Search
  • Hands-on Exercise
    • Deploy tags for Splunk search
    • Generate and implement event types in Search

Creating and Using Macros

Define Macros, Arguments and Variables in a Macro

  • Hands-on Exercise
    • Define a Macro with arguments and use variables in it

Workflow

  • GET, POST, and Search workflow actions
  • Hands-on Exercise
    • Create GET, POST, and Search workflow

Splunk Search Commands

  •   Search Command study
  •   Search practices in general
  •   Search pipeline
  •   Specify indexes in search
  •   Syntax highlighting
  •   Autocomplete
  •   Search commands like tables, fields, sort, multikv, rename, rex &erex
  • Hands-on Exercise
    • Create search pipeline
    • Specify indexes in search
    • Highlight syntax
    • Use autocomplete feature
    • Use search commands like tables, fields, sort, multikv, rename, rex &erex

 Transforming Commands

  • Using Top, Rare, Stats Commands
  • Hands-on Exercise
    • Use Top, Rare, Stats Commands

Reporting Commands

  • Using following commands and their functions:
  1. addcoltotals
  2. add totals
  • Top
  1. rare
  2. stats
  • Hands-on Exercise
    • Create reports using following commands and their functions: addcoltotals, addtotals

Mapping and Single Value Commands

  • IP location
  • Ggeostats
  • Geom
  • Add total commands
  • Hands-on Exercise
    • Track IP using IP location
    • Get geo data using geo stats

Splunk Reports & visualizations

  • Explore the available visualizations
  • Create charts and time charts
  • Omit null values
  • Format results
  • Hands-on Exercise
    • Create time charts
    • Omit null values
    • Format results

Analyzing, Calculating and Formatting Results

  • Calculating and analyzing results
  • Value conversion
  • Roundoff and format values
  • Using eval command
  • Conditional statements
  • Filtering calculated search results
  • Hands-on Exercise
    • Calculate and analyze results
    • Perform conversion on a data value
    • Roundoff a numbers
    • Use eval command
    • Write conditional statements
    • Apply filters on calculated search results

Correlating Events

  • Search with Transactions
  • Report on Transactions
  • Group events using fields and time
  • Transaction vs Stats
  • Hands-on Exercise
    • Generate Report on Transactions
    • Group events using fields and time

Enriching Data with Lookups

  • Learn about data lookups, example, lookup table
  • Defining and configuring automatic lookup
  • Deploying lookup in reports and searches
  • Hands-on Exercise
  • Define and configure automatic lookup,
  • Deploy lookup in reports and searches

Creating Reports and Dashboards

  • Creating search charts, reports and dashboards
  • Editing reports and Dashboard
  • Adding reports to dashboard
  • Hands-on Exercise
    • Create search charts, reports and dashboards,
    • Edit reports and Dashboard,
    • Add reports to dashboard

Getting started with Parsing

  • Working with raw data for data extraction
  • Transformation
  • Parsing and preview
  • Hands-on Exercise –
    • Extract useful data from raw data,
    • perform transformation,
    • parse different values and preview

Using Pivot

  • Describe Pivot
  • Relationship between data model and pivot
  • Select a data model object
  • Create a pivot report
  • Instant pivot from a search
  • Add a pivot report to dashboard
  • Hands-on Exercise
    • Select a data model object
    • Create a pivot report
    • Create instant pivot from a search
    • Add a pivot report to dashboard

Common Information Model (CIM) Add-On

  • What is Splunk CIM
  • Using the CIM Add-On to normalize data
  • Hands-on Exercise
    • Use the CIM Add-On to normalize data

Splunk Administration

Overview of Splunk

  • Introduction to the Splunk 3 tier architecture
  • Understanding the Server settings
  • Control, preferences and licensing
  • The most important components of Splunk tool
  • The hardware requirements
  • Conditions for installation of Splunk.

Splunk Installation

  • Understanding how to install and configure Splunk
  • Index creation
  • Input configuration in the standalone server
  • The search preferences
  • Installing Splunk in the Linux environment.

Splunk Installation in Linux

  • Installing Splunk in the Linux environment
  • The various prerequisites
  • A configuration of Splunk in Linux.

Distributed Management Console

  • Introduction to the Splunk Distributed Management Console
  • Index clustering
  • Forwarder management and distributed search in Splunk environment
  • Providing the right authentication to users, access control.

Introduction to Splunk App

  • Introducing the Splunk app
  • Managing the Splunk app
  • The various add-ons in Splunk app
  • Deleting and installing apps from SplunkBase
  • Deploying the various app permissions
  • Deploying the Splunk app
  • Apps on forwarder.

Splunk indexes and users

  • Understanding the index time configuration file and search time configuration file.

Splunk configuration files

  • Learning about the index time and search time configuration files in Splunk
  • Installing the forwarders
  • Configuring the output and inputs.conf
  • Managing the Universal Forwarders.

Splunk Deployment Management

  • Deploying the Splunk tool
  • The Splunk Deployment Server
  • Setting up the Splunk deployment environment
  • Deploying the clients grouping in Splunk.

Splunk Indexes

  • Understanding the Splunk Indexes
  • The default Splunk Indexes
  • Segregating the Splunk Indexes
  • Learning about Splunk Buckets and Bucket Classification
  • Estimating index storage
  • Creating a new index.

User roles and authentication

  • Understanding the concept of role inheritance
  • Splunk authentications
  • Native authentications
  • LDAP authentications.

Splunk Administration Environment

  • Splunk installation
  • Configuration
  • Data inputs
  • App management
  • Splunk important concepts
  • Parsing machine-generated data
  • Search indexer and forwarder.

Basic Production Environment

  • Introduction to Splunk Configuration Files
  • Universal Forwarder
  • Forwarder Management
  • Data management
  • Troubleshooting and monitoring.

Splunk Search Engine

  • Converting machine-generated data into operational intelligence
  • Setting up Dashboard, Reports and Charts
  • Integrating Search Head Clustering & Indexer Clustering.

Various Splunk Input Methods

  • Understanding the input methods
  • Deploying scripted Windows, network and agentless input types, fine-tuning it all.

Splunk User & Index Management

  • Splunk User authentication and Job Role assignment
  • Learning to manage, monitor and optimize Splunk Indexes.

Machine Data Parsing

  • Understanding parsing of machine-generated data
  • Manipulation of raw data
  • Previewing and parsing
  • Data field extraction.

Search Scaling and Monitoring

  • Distributed search concepts
  • Improving search performance
  • Large-scale deployment and overcoming execution hurdles
  • Working with Splunk Distributed Management Console for monitoring the entire operation.

Apache Solr

Solr is an open source enterprise search platform, written in Java, from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.

Fundamentals of Search Engine and Apache Lucene

  • Introduction to the search engine
  • The Apache Lucene
  • Understanding the inverted index, documents and fields & documents.

Analyzers in Lucene

  • Introduction to the various query types available in Lucene and clear understanding of these.

Exploring Apache Lucene

  • Understanding the prerequisites for using Apache Lucene
  • Learning about the querying process, analyzers, scoring boosting, faceting, grouping, highlighting
  • The various types of geographical and spatial searches
  • Introduction to Apache Tika.

Apache Lucene Demonstration

  • Demonstration of the Apache Lucene workings.

Apache Lucene advanced

  • Understanding the Analyzer
  • Query Parser in Apache Lucene
  • Query Object

Advance topics of Apache Lucene (practical)

  • Understanding the various aspects of Apache Lucene like Scoring, Boosting, Highlighting, Faceting and Grouping

Apache Solr

  • Introduction to Apache Solr
  • The advantages of Apache Solr over Apache Lucene
  • The basic system requirements for using Apache Solr
  • Introduction to Cores in Apache Solr.

Apache Solr Indexing

  • Introduction to the Apache Solr indexing
  • Index using built-in data import handler and post tool
  • Understanding the Solrj Client
  • A configuration of Solrj Client.

Solr Indexing continued

  • Demonstrating the Book Store use cases with Solr Indexing with practical examples
  • Learning to build Schema
  • The field, field types, CopyField and Dynamic Field
  • Understanding how to add, explore, update, and delete using Solrj.

Apache Solr Searching

  • The various aspects of Apache Solr search like sorting, pagination
  • An overview of the request parameters, faceting and highlighting.

Deep dive into Apache Solr

  • Understanding the Request Handlers
  • Defining and mapping to search components
  • Highlighting and faceting
  • Updating managed schemas
  • Request parameters hardwiring
  • Adding fields to a default search
  • The various types of Analyzers, Parsers, Tokenizers.

Apache Solr continued

  • Grouping of results in Apache Solr
  • Parse queries functions
  • A fuzzy query in Apache Solr.

Extended Features

  • The extended features in Apache Solr
  • Learning about Pseudo-fields, Pseudo-Joins, Spell Check, suggestions, Geospatial Search, multi-language search, stop words and synonyms.

Multicore

  • Understanding the concept of Multicore in Solr
  • The creation of Multicore in Solr
  • The need of Multicore
  • Joining of data
  • Replication and Ping Handler.

Administration & SolrCloud

  • Understanding the SolrCloud
  • The concept of Sharding
  • Indexing, and replication in Apache SolrCloud
  • The working of Apache SolrCloud
  • Distributed requests
  • Reading and writing slide fault tolerance
  • Cluster coordination using Apache ZooKeeper.
“Big Data Hadoop Training” by Real-Time I.T Corporate Trainers with 100% Placement Assistance.

 

 


Type Lesson Title Time

Scroll to Top