Apache Spark Training Chennai

Module 1

Bigdata Landscape

Why Bigdata-3 v s-Hadoop Ecosystem

Introduction to Apache Spark

Features of Apache Spark

Apache Spark Stack

Introduction to RDD’s

RDD’s Transformation

What is good and bad In MapReduce?

Why to use Apache Spark

Module 2


Single node

Include Hadoop

Include Apache Spark

Include Hive

Include Sqoop

Include Hue

Module 3

Deep Dive in HDFS

HDFS Design

Fundamental of HDFS

Rack Awareness

Read/Write from HDFS

HDFS Federation and High Availability (Hadoop 2xx)

HDFS Command Line Interface

Module 4

Spark Shell Hands On Using HDFS

Spark Shell Introduction

Create file using Hue-Spark Shell extracting file from HDFS

Create RDD from HDFS file

Module 5

Programming with RDD Part-1

Creating new RDD

Transformations on RDD

Lineage Graph

Actions on RDD

RDD Concepts on Persist and Cache

Lazy evaluation of RDD

Module 6

Scala/Spark Functional Programming

Using Function Literals

Anonymous Functions

Define a function which accepts another function

Module 7

RDD Transformation Programming in Depth

Hands on and core concepts of map() transformation

Hands on and core concepts of filter() transformation

Hands on and core concepts of flatMap() transformation

Compare map and flatMap transformation

Module 8

Apache Spark in Action

Hands on and core concepts of reduce() action

Hands on and core concepts of fold() action

Hands on and core concepts of aggregate() action

Basics of Accumulator-Hands on and core concepts of collect() action

Hands on and core concepts of take() action

Ordered access of RDD

Module 9

Apache Spark Execution Model

How Spark execute program

Concepts of RDD partitioning

RDD data shuffling and performance issue

Module 10

Apache Spark PairRDD

Core concepts of PairRDD

Creation of PairRDD

Aggregation in PairRDD

Aggregation functions understanding in depth

How reduceByKey() work conceptually?

How foldByKey() work conceptually?

How combineByKey()work conceptually?

Module 11

Spark PairRDD HandsOn Lab





Module 12

Spark PairRDD Joining, Zipping and

reduceByKey versus groupByKey performance issue



joining (left, right, inner etc)

Module 13

Understanding Hadoop SequenceFile

Creating Seqnce File and Processing using SPark

Creating SequenceFile using TSV file

Loading Data in Apache Hive

Processing SequnceFile as an RDD

Module 14

Spark Shared Variables

Shared Variables: Broadcast Variables-Shared Variables: Accumulators

Module 15

Spark Accumulator

Word count and Character Count

Counting Bad records in a file

Module 16

Spark BroadCast Variable

Joining two csv files one as a Broadcasted Lookup table

Module 17

Spark API

BroadCast Variable, Filter Functions and Saving File

Module 18

Spark API

Spark Join, GroupBy and Swap function

Module 19

Spark API

Remove Header from CSV file and Map Each column to Row Data

Module 20

Spark SQL


Schema RDD replaced by DataFrame API

History of SparkSQL

Catalyst Optimizer

Module 21

SparkSQL HandsOn Sessions

Hive Configuration

Create Hive table using Spark

Load Data in HIve table using Spark

Create another table using DataFrame

Module 22

Implementing Business Logic using SparkSQL

Loading CSV file

Spark Case classes (To create schema for csv file)

Convert RDD to DataFrame using DataFrmae API for query data

Using SQL query on DataFrame

Module 23

Spark Loading and Saving Your Data


CSV and TSV files

JSON Files

Module 24

Spark Loading and Saving Your Data SQL and NOSQL


HBase (NoSQL)

Module 25

Writing Spark Applications

Spark Applications vs Spark Shell

Creating the SparkContext

Configuring Spark Properties

Building and Running a Spark Application


Module 26

Spark Streaming in Depth Part-1

Spark Streaming Overview-Example: Streaming Word Count

Module 27

Spark Streaming in Depth Part-2

Other Streaming Operations

Sliding Window Operation

Developing Spark Streaming Applications

Module 28

Spark Algorithms Part-1

Iterative Algorithm

Graph Analysis

Machine Learning

Module 29

Case studies