Big Data Course Content

May 10, 2016

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy.

From Wikipedia

Apache Hadoop is 100% open source, and pioneered a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big.

From Cloudera

 

Apache Spark™ is a fast and general engine for large-scale data processing.

Day 1:Bigdata Landscape

Why Bigdata-3 v s-Hadoop Ecosystem.

Introduction to Apache Spark.

Features of Apache Spark.

Apache Spark Stack.

Introduction to RDD’s.

RDD’s Transformation.

What is good and bad In MapReduce?.

Why to use Apache Spark.

Day 2: Installation

Single node ,Pseudo-distribution and Multinode Cluster.

Include Hadoop.

Include Apache Spark.

Include Hive.

Include Sqoop.

Include Hue.

Day 3: Deep Dive in HDFS

HDFS Design.

Fundamental of HDFS (Blocks, NameNode, DataNode, Secondary Name Node).

Rack Awareness.

Read/Write from HDFS.

HDFS Federation and High Availability (Hadoop 2.x.x).

HDFS Command Line Interface.

Day 4: Spark Shell Hands On Using HDFS

Spark Shell Introduction.

Create file using Hue-Spark Shell extracting file from HDFS.

Create RDD from HDFS file.

Day 5: Programming with RDD Part-1

Creating new RDD.

Transformations on RDD.

Lineage Graph.

Actions on RDD.

RDD Concepts on Persist and Cache.

Lazy evaluation of RDD .

Day 6: Scala/Spark Functional Programming

Using Function Literals.

Anonymous Functions.

Define a function which accepts another function.

Day 7: RDD Transformation Programming in Depth

Hands on and core concepts of map() transformation.

Hands on and core concepts of filter() transformation.

Hands on and core concepts of flatMap() transformation.

Compare map and flatMap transformation.

Day 8: Apache Spark in Action

Hands on and core concepts of reduce() action.

Hands on and core concepts of fold() action.

Hands on and core concepts of aggregate() action.

Basics of Accumulator-Hands on and core concepts of collect() action.

Hands on and core concepts of take() action.

Ordered access of RDD.

Day 9: Apache Spark Execution Model

How Spark execute program.

Concepts of RDD partitioning.

RDD data shuffling and performance issue.

Day 10: Apache Spark PairRDD

Core concepts of PairRDD.

Creation of PairRDD.

Aggregation in PairRDD.

Aggregation functions understanding in depth.

How reduceByKey() work conceptually?

How foldByKey() work conceptually?

How combineByKey()work conceptually?

Day 11: Spark PairRDD HandsOn Lab

reduceByKey.

foldByKey.

combineByKey.

groupByKey.

Day 12 : Spark PairRDD Joining, Zipping and

reduceByKey versus groupByKey performance issue.

cogroup.

zip.

joining (left, right, inner etc.)

Day 13-A: Understanding Hadoop SequenceFile

Day 13-B: Creating Seqnce File and Processing using SPark

Creating SequenceFile using TSV file.

Loading Data in Apache Hive.

Processing SequnceFile as an RDD.

Day 14 : Spark Shared Variables

Shared Variables: Broadcast Variables-Shared Variables: Accumulators.

Day 15 : Spark Accumulator

Word count and Character Count.

Counting Bad records in a file.

Day 16 : Spark BroadCast Variable

Joining two csv files one as a Broadcasted Lookup table.

Day 17 : Spark API

BroadCast Variable, Filter Functions and Saving File

Day 18 : Spark API

Spark Join, GroupBy and Swap function

Day 19 : Spark API

Remove Header from CSV file and Map Each column to Row Data

Day 20 : Spark SQL

HiveContext.

Schema RDD replaced by DataFrame API.

History of SparkSQL.

Catalyst Optimizer.

Day 21 : SparkSQL HandsOn Sessions

Hive Configuration.

Create Hive table using Spark.

Load Data in HIve table using Spark.

Create another table using DataFrame.

Day 22 : Implementing Business Logic using SparkSQL

Loading CSV file.

Spark Case classes (To create schema for csv file).

Convert RDD to DataFrame using DataFrmae API for query data.

Using SQL query on DataFrame.

Day 23 : Spark Loading and Saving Your Data

TextFiles.

CSV and TSV files.

JSON Files.

Day 24 : Spark Loading and Saving Your Data SQL and NOSQL

JDBC (MySQL).

HBase (NoSQL).

Day 25 : Writing Spark Applications

Spark Applications vs. Spark Shell.

Creating the SparkContext.

Configuring Spark Properties.

Building and Running a Spark Application.

Logging.

Day 26 : Spark Streaming in Depth Part-1

Spark Streaming Overview-Example: Streaming Word Count.

Day 27 : Spark Streaming in Depth Part-2

Other Streaming Operations.

Sliding Window Operation.

Developing Spark Streaming Applications.

Day 28 : Spark Algorithms Part-1

Iterative Algorithm.

Graph Analysis.

Machine Learning.

Day 29 : Case studies

Day 1

Introduction to Big Data.

Characteristics.

Why, How and What s of Big data.

Existing OLTP, ETL,DWH,OLAP.

Day 2

Introduction to Hadoop Ecosystem.

Architecture-HDFS.

Map reduce (MRV1).

Hadoop v1 and v2 Hadoop Data fedaration.

DAY 3

Pre Requisite for Installation.

VM Linux ubuntu/CentOS JDK,ssh,eclipse.

Installation and configuration of Hadoop,HDFS,Daemons,YARN Daemons.

High Availability.

Automatic and manual failover.

Writing Data to HDFS.

Reading Data from DFS.

Day 4

Replica placement Strategy.

Failure Handling.

Namenode .

Datanode.

Block-Safe mode.

Rebalancing and load optimization.

Trouble shooting and error rectification.

Hadoop fs shell commands-Unix and Java-Basics.

Day 5

Introduction to Mapreduce.

Architecture of Map reduce.

Execution Map reduce in YARN.

App Master ,Resource Manager and Node manager-Inputformat and Key Value Pairs.

Mapper.

Reducer.

Partitioner

Custom and Default.

Shuffle and Sort.

Combiner-Scheduler.

App Master /manager.

Container-Node manager.

Day 6

Map reduce Hands on.

word count program/ log analytics.

Hadoop streaming in R and Python.

Data processing Transformations.

Map only jobs and uber jobs.

Inverted index and searches.

Day 7

MR Programs 2:

Structured and Unstructured Data handling.

Combiner.

Partitioner

Single and multiple column.

Inverted Index.

XML -semi structure.

Map side joins.

Reduce side join.

Day 8

Introduction to HIVE Datawarehouse.

Installation.

Configure metastore to mysql- Hive QL Commands.

Day 9

Manipulation and anlytical function in hive.

Managed table and external tables.

Partitioning and Bucketing.

Complex data types and Unstructured data.

Advance HQL commands.

UDF and UDAF.

Integration with Hbase.

SerDe / Regular Expression.

Day 10

Introduction to PIG.

Installation-Bags and collections

Commands and Scripts.

Pig UDF.

Day 11

Introduction to NOSQL.

ACID /CAP/BASE.

Key value pair.

Map reduce.

Column family.

Hbase- Documen.MongoDB,

Graph DB.

Neo4j.

Day 12

Introduction to HBASE and installation.

The HBase Data Model.

The HBase Shell.

HBase Architecture.

Schema Design.

The HBase API.

HBase Configuration and Tuning.

Day 13

Introduction to Sqoop and installation.

Bulk loading.

Hadoop Streaming.

Day 14

Flume Architecture.

Agent ,Source,sink channel.

Ingest log file.

Collecting data from twitter for Sentimental analysis.

Day 15

Integrate With ETL-Talend open Data studio.

BD.

Day 16

Big data Analytics.

Visualization Dimensional modelling Tableau.

Day 17

Spark.

Spark Shell Hands On Using HDFS.

Create RDD from HDFS file.

Creating new RDD-Transformations on RDD.

Lineage Graph.

Actions on RDD.

RDD Concepts on Persist and Cache-Lazy evaluation of RDD.

Hands on and core concepts of map() transformation.

Hands on and core concepts of filter() transformation.

Hands on and core concepts of flatMap() transformation Compare map and flatMap transformation Hands on and core concepts of reduce() action.

Hands on and core concepts of fold() action-Hands on and core concepts of aggregate() action.

Basics of Accumulator.

Hands on and core concepts of collect() action.

Hands on and core concepts of take() action.

Apache Spark Execution Model.

How Spark execute program.

Concepts of RDD partitioning.

RDD data shuffling and performance issue.

Day 18

Spark SQL

Day 19

Spark submit and spark Application.

20

KAFKA-Publisher /Subcrriber

Consumer and producer.

Day 22

Cloudera manager and VM-HUE.

Day 23

OOZIE-Workflow and Co-ordinator.

Day 24

Introduction to Data science-Machine learning-Statistical Analysis-Sentiment Analysis-Cloudera-/Hortonworks/Greenplum.

Day 25

Use Multinode cluster setup-High Availabilty-Hadoop data federation-Commissioning and-decommissioning-Automatic and manual failover-Zookeeper failover controller-Use cases, Case studies and Proof of Concept-Working on different Distributions

Day 26:(Optional)

Cloudera and Horton works Certification Questions Discussion.

What is Big Data ?

Big Data Facts

The Three V’s of Big Data

Understanding Hadoop

What is Hadoop ?,Why learn Hadoop ?

Relational Databases Vs. Hadoop

Motivation for Hadoop

6 Key Hadoop Data Types

The Hadoop Distributed File system (HDFS)

What is HDFS ?

HDFS components

Understanding Block storage

The Name Node

The Data Nodes

Data Node Failures

HDFS Commands

HDFS File Permissions

The MapReduce Framework

Overview of MapReduce

Understanding MapReduce

The Map Phase

The Reduce Phase

WordCount in MapReduce

Running MapReduce Job

Planning Your Hadoop Cluster

Single Node Cluster Configuration

Multi-Node Cluster Configuration

Checking HDFS Status

Breaking the cluster

Copying Data Between Clusters

Adding and Removing Cluster Nodes

Rebalancing the cluster

Name Node Metadata Backup

Cluster Upgrading

Installing and Managing Hadoop Ecosystem Projects

Sqoop

Flume

Hive

Pig

HBase

Oozie

Managing and Scheduling Jobs

Managing Jobs

The FIFO Scheduler

The Fair Schedule

How to stop and start jobs running on the cluster

Cluster Monitoring, Troubleshooting, and Optimizing

General System conditions to Monitor

Name Node and Job Tracker Web Uis

View and Manage Hadoop’s Log files

Ganglia Monitoring Tool

Common cluster issues and their resolutions

Benchmark your cluster’s performance

Populating HDFS from External Sources

How to use Sqoop to import data from RDBMSs to HDFS

How to gather logs from multiple systems using Flume

Features of Hive, Hbase and Pig

How to populate HDFS from external Sources