ETL TOOLS Course Content

May 10, 2016

Overview of Extraction, Transformation, and Loading

You need to load your data warehouse regularly so that it can serve its purpose of facilitating business analysis. To do this, data from one or more operational systems needs to be extracted and copied into the warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. The acronym ETL is perhaps too simplistic, because it omits the transportation phase and implies that each of the other phases of the process is distinct. We refer to the entire process, including data loading, as ETL. You should understand that ETL refers to a broad process, and not three well-defined steps.

The methodology and tasks of ETL have been well known for many years, and are not necessarily unique to data warehouse environments: a wide variety of proprietary applications and database systems are the IT backbone of any enterprise. Data has to be shared between applications or systems, trying to integrate them, giving at least two applications the same picture of the world. This data sharing was mostly addressed by mechanisms similar to what we now call ETL.

Data warehouse environments face the same challenge with the additional burden that they not only have to exchange but to integrate, rearrange and consolidate data over many systems, thereby providing a new unified information base for business intelligence. Additionally, the data volume in data warehouse environments tends to be very large.

What happens during the ETL process? During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest, therefore more data than necessary has to be extracted, so the identification of the relevant data will be done at a later point in time. Depending on the source system’s capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two (logically) identical extractions: the time span may vary between days/hours and minutes to near real-time. Web server log files for example can easily become hundreds of megabytes in a very short period of time.

After extracting data, it has to be physically transported to the target system or an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too. For example, a SQL statement which directly accesses a remote target through a gateway can concatenate two columns as part of the SELECT statement.

The emphasis in many of the examples in this section is scalability. Many long-time users of Oracle are experts in programming complex data transformation logic using PL/SQL. Hear we suggest alternatives for many such data manipulation operations, with a particular emphasis on implementations that take advantage of Oracle’s new SQL functionality, especially for ETL and the parallel query infrastructure.

Chapter 1 Introduction and Architecture

Introduction to Informatica.

Architecture of Informatica PowerCenter.

Installation and configuration.

Informatica Design Process.

Domain,node and services.

Repository Service.

Integration service.

Configure security.

Informatica Client Tool.

Designer,Workflow manager ,Workflow monitor.

Designer tools.

Source analyzer,target designer,Transformation developer.

Mapping and Mapplet designer.

Chapter 2 Mapping Designer

Create a Simple Mapping.

Overview of Transformations.

Types of Transformations.

Active and Passive.

Expression Transformation.

Source Qualifier Transformation.

Introduction to Workflow Manager.

Task Developer.

Workflow Designer.

Creating Connections.

Chapter 3 Flat Files Handling

Create Simple Mappings.

Flat File Analysis.

Chapter 4 Architecture Review

Architecture Review.

Differences between PowerMart and PowerCenter.

Connecting to SAP-FTP Process.

Command Task.

Create a Mapping.

Joiner Transformation.

Chapter 5 Transformations

Create a Mapping.

Using Join in a Source Qualifier Transformation.

Expression Transformation.

Filter Transformation.

Chapter 6 Transformations Session1

Multiple Targets Loading.

Router Transformation.

Difference Between Filter and Router Transformations.

Create a Mapping.

Master Outer Join.

Chapter 7 Transformations Session2

Reusable Transformation.

Lookup Transformation.

Debugger.

Handling Multiple Files as Source.

Chapter 8 Mapplets

Sequence Generator Transformation.

Multiple Pipelines in a Single Mapping.

Target Load Plan-Mapplets.

Chapter 9 Update Strategy

SCD Type 1 mapping Implementation.

Update Strategy Transformation.

Chapter 10 SCD Type 2 and Type 3

SCD Type 2 and Type 3 mapping Implementation

Chapter 11 Loading Strategy

Sequential and Parallel Loading.

Union Transformation.

Dynamic Lookup Cache.

Introduction to Fact Loading.

Chapter 12 Transformations

Date Dimension Loading.

Lookup Transformation.

Persistent Cache.

Aggregator Transformation.

Sorter Transformation.

Creating Stored Procedure.

Chapter 13 Transformations

Normalizer Transformation.

Stored Procedure Transformation.

Active Vs Passive Transformations.

Connected Vs UnConnected Transformations.

Chapter 14 Implementing Incremental Loading

Importance of Mapping Variable.

Chapter 15 Implementing Incremental Aggregation

Create a Mapping.

Incremental Aggregation.

Chapter 16 – Transaction and Variable Port

Transaction Control Transformation.

Variable Ports.

Chapter 17 – Workflow Tasks

Scheduling.

Task Link Properties.

Event Raise Task.

Event Wait Task.

Worklet .

Timer.

Command.

Email.

Control.

Decision.

Assignment.

SQL Concepts.

Data warehousing Concepts.

Data Stage Concepts.

DS Components.

Repository and Table definition.

Built-in Components.

Stage Editors.

Types of Parallelism.

Partition Techniques.

Containers (Local & Shared).

Managing Metadata Environment.

Importing & Exporting Jobs.

General, File, Processing, Database, Debug/Development.

Restructure stages, Transformer, Lookup-Job Sequences.

Slowly Changing Dimension (SCD).

Performance and Tuning Aspect,FAQ.

 

Chapter I

Introduction to Ab initio.

Operating Systems & Hardware Platforms that it supports.

Chapter II

Products of Ab Initio Software Corporation.

The Ab Initio Co-Operating System.

Graphical Development Environment (GDE).

The Ab Initio Enterprise Meta Environment (EME).

Chapter III

Salient Features of A Initio.

Chapter IV

Understanding the High Level Architecture of Ab Initio.

What is the Anatomy of Running a Job?.

What happens when we push the ‘Run’ Button?.

Deployment Process.

Host Process Creation.

Agent.

Process Creation.

Component Process Creation.

Successful Component Termination.

Agent Termination.

Host Termination.

Abnormal Component Termination.

How do we Connect from GDE to the Server.

Host Profile File (.aih)-Connection, Co-Operating System, Shell & Host Directory.

What is Graph Programming? What are the basic parts of an Ab Initio Graph?.

Broad Classification of Components.

Dataset Components-Program / Computing Components.

Data Manipulation Language (DML).

Record Formats-DML Expressions-Transform Functions-Key Specifiers.

How do we build a basic Ab Initio Graph? Using .dat(Data File) & .dml(Record Format File) Dataset Components.

Input File.

Output File.

Intermediate File.

Lookup file.

Editors in the GDE.

Record Format Editor-Expression Editor.

Transform Editor.

Key Specifier Editor.

Package Editor.

Variables Editor.

Sequence Specifier Editor.

Transform Functions.

Transform File(.xfr)

Functions, Variables, Statements & Rules.

Transform Components.

Multistage and Advanced Components.

Filter by Expression.

Dedup Sorted.

Reformat.

Aggregate.

Scan.

Rollup.

Join.

Match Sorted.

Normalize.

Denormalize Sorted.

Lookup Dataset’s & DML Lookup Functions Partition Components.

Broadcast.

Partition by Key

Partition by Expression.

Partition by Percentage.

Partition by Range.

Partition by Round Robin.

Partition with Load Balancing.

Multifiles and Multidirectories De-partition Components.

Concatenate.

Gather Interleave.

Merge.

What is Parallelism – Types

Component ||’sm.

Pipeline ||’sm-Data ||’sm.

Repartitioning – When & Why Layout Implications

Sort

Sort within Groups.

Sample.

Partition by Key.

Sort.

Check point Sorted.

Working with Database’s.

Introduction Database Configuration File (.dbc).

Interactive Database Components (IDB).

Input Table.

Output Table.

Run SQL.

Truncate Table.

Update Table.

Join with DB.

Miscellaneous Components

Redefine Format.

Gather Logs.

Replicate.

Run Program.

Thrash.

Deprecated Components.

Find Splitters.

Compress Components.

Compress Uncompress GZip Gunzip

File Transfer Protocol (FTP) Components.

FTP From-FTP To.

Testing and Validation of Ab Initio Graphs using Validate Components

Check Order.

Compare Records.

Validate Records.

Generate Records.

Generate Random Bytes.

Compute Checksum.

Compare Checksum.

Phasing & Checkpointing Reusability Features

Sub graphs Graph Parameters.

The Ab Initio Enterprise Meta Environment(EME)

High level anatomy of the EME.

Relation to sandbox environment.

Checking in your sandbox.

Checking out a graph or a sandbox.

Locking in order to change an object.

Tagging.

Common projects.

Dependency and Impact analysis.

Performance Tuning Ab Initio Graphs

What is Good Performance?.. Going Parallel… Serial Inputs… Phases… Record Formats and Components… Memory Usage… Various Tips

Oracle Warehouse Builder(OWB)

Oracle Warehouse Builder– Installation, Concepts, and Setup

Oracle Warehouse Builder Concepts and Overview.

Installing Warehouse Builder (lesson, demo, and lab).

Describe componentts.

Creating a Design Repository.

Navigation.

Capturing Data Source Definitions.

Defining Source Metadata.

Defining Staging Metadata.

Mention Data Rules and Mapping (covered in detail later).

Data Profiling.

Introduction to Dimensional Modeling.

Defining a relational dimension model.

Slowly Changing Dimensions.

Overview of Extract, Transform, and Load (ETL)

Usage

Designing Data Objects.

Designing Process Flows.

Introduction to Data Mapping.

Using the Mapping Editor.

OWB Editors.

Object Editor.

Table Editor.

Dimension Editor.

Expert Editor.

Loading Data.

Flat File Loading.

Partition Exchange Loading.

Query Loading.

Deploying Warehouse Builder in Production Environments.

Advanced Concepts

Transferring Metadata from OWB to Discoverer and Oracle OLAP.

Managing Oracle Warehouse Builder.

OLAP Modeling Concepts.

OWB Scripting.

Creating Experts.

Identify and Manage Source Changes.

Manage and Upgrade Warehouse Structures.

 

Introduction

Identifying the Course Units-What is Oracle Data Integrator?.

Why Oracle Data Integrator?.

Overview of ODI 11g Architecture.

Overview of ODI 11g Components.

About Graphical Modules.

Types of ODI Agents.

Overview of Oracle Data Integrator Repositories.

Administering ODI Repositories and Agents.

Administrating the ODI Repositories.

Creating Repository Storage Spaces.

Creating and Connecting to the Master Repository.

Creating and Connecting to the Work Repository.

Managing ODI Agents.

Creating a Physical Agent.

Launching a Listener, Scheduler and Web Agent.

Example of Load Balancing.

ODI Topology Concepts.

Overview of ODI Topology.

About Data Servers and Physical Schemas.

Defining the Physical Architecture.

Defining the Logical Architecture.

Mapping Logical and Physical Resources.

Defining Agents.

Defining a Topology.

Planning the Topology

Describing the Physical and Logical Architecture.

Overview of Topology Navigator.

Creating Physical Architecture.

Creating a Data Server.

Testing a Data Server Connection.

Creating a Physical Schema.

Creating Logical Architecture.

Overview of Logical Architecture and Context Views.

Linking the Logical and Physical Architecture.

Setting Up a New ODI Project.

Overview of ODI Projects.

Creating a New Project.

Using Folders.

Organizing Projects and Folders.

Understanding Knowledge Modules.

Exchanging ODI Objects.

Exporting and Importing Objects.

Using Markers.

Oracle Data Integrator Model Concepts.

What is a Model?.

Understanding Metadata in ODI.

Understanding Reverse Engineering-Creating Models.

Organizing Models.

Creating Data stores.

Using Constraints in ODI.

Creating Keys and References.

Organizing ODI Models and Creating Data stores

What is an Interface?.

Business Rules for Interfaces.

What is a Mapping?.

What is a Join?.

What is a Filter?.

What is a Constraint?.

What is a Staging Area?.

Creating a Basic Interface.

ODI Interface Concepts

What is an Interface?.

Business Rules for Interfaces.

What is a Mapping, Filter, Join?.

Overview of Integration Process.

What is a Staging Area?.

About Execution Location.

Using Knowledge Modules (KM) with ODI Interface.

Creating a Basic Interface.

Designing Interfaces

Designing an Interface.

Multiple Source Data stores.

Creating Joins.

Filtering data.

Disabling Transformations.

Overview of the Flow.

Specifying the Staging Area.

Selecting Knowledge Modules.

Interfaces: Monitoring and Debugging.

Monitoring Interfaces.

Using Operator.

Viewing Sessions and Tasks.

How to Monitor Execution of an Interface.

How to Troubleshoot a Session.

Keys to Reviewing the Generated Code.

Working with Errors.

Tips for Preventing Errors.

Designing Interfaces: Advanced Topics

Using Business Rules in Interfaces.

Overview of Business Rule Elements.

Using variables.

Using User Functions.

Using Substitution Methods.

Modifying a KM.

Developing Your Own KM.

Using RKM for Customized Reverse Engineering.

Using ODI procedures.

What is a Procedure?.

Examples of Procedures.

Creating Procedures.

Adding Commands.

Adding Options.

Running a Procedure.

Using Operator to View Results

Using ODI Packages.

What is a package?.

Creating a package.

Executing a package.

Creating Advanced Packages.

Error handling.

Controlling an Execution Path.

Creating a Loop.

Using the Advanced tab.

Managing ODI Scenarios and Versions.

What is a Scenario?.

Managing Scenarios.

Preparing Scenarios for Deployment.

Automating Scenario Management.

Scheduling the ODI Scenario.

Overview of ODI version management.

Using Version Browser and Version Comparison Tool.

Handling concurrent changes.

Enforcing Data Quality and Auditing Data with ODI.

Why Data Quality?.

When to Enforce Data Quality?.

Data Quality in Source Applications.

Data Quality Control in the Integration Process.

Data Quality in the Target Applications.

Enforcing Data Quality.

Exploring Your Data.

Auditing Data Quality.

Working with Changed Data Capture.

Overview of ODI version management.

Techniques of Changed Data Capture.

Changed Data Capture in ODI.

CDC Strategies and Infrastructure.

CDC Consistency.

Using CDC.

Viewing Data/Changed data.

Using Journalizing.

Administering ODI Resources: Advanced Topics.

Using Open Tools.

Installing Open Tools.

Using Open Tools in a Package.

Using Open Tools in a Procedure or in a KM.

Developing Your Own Open Tools.

Setting Up ODI Security.

Defining Security Policies.

Defining Password Policies.

Using Web Services and Integration of Oracle Data Integrator with SOA.

Web Services in Action.

Using Data Services.

Setting Up Data Services.

Testing Data Services.

Installing Public Web Services.

Using Public Web Services.

Invoking Web Services.

Integrating ODI with SOA.

Extending ODI with the SDK.

Using SDK Public Interfaces.

Integrating through ODI SDK.

Examining SDK examples.

Introduction to Integration Services

Defining SQL Server Integration Services.

Exploring the need for migrating diverse data.

The role of business intelligence (BI).

Illustrating SSIS architecture.

Distinguishing between data flow pipeline and package runtime.

Executing packages on the client side or hosted in the SSIS service.

Upgrading legacy DTS.

Executing existing DTS packages in the SSIS environment.

Converting DTS packages to SSIS with the migration wizard.

Logging migration results.

Implementing Tasks and Containers.

Utilizing basic SSIS objects.

Configuring connection managers.

Adding data flow tasks to packages.

Reviewing progress with data viewers.

Assembling tasks to perform complex data migrations.

Operating system level tasks.

Copying, moving and deleting files.

Transferring files with the FTP task.

Reading system information with WMI query language (WQL).

Communicating with external sources.

Sending messages through mail.

Detecting system events with WMI.

Processing XML.

Iterating XML nodes.

Writing XML files from databases.

Extending Capabilities with Scripting.

Writing expressions.

Making properties dynamic with variables.

Building expressions in Expression Builder.

Script Task.

Extending functionality with the Script Task.

Debugging, breakpoints, watches.

Transforming with the Data Flow Task.

Performing transforms on columns.

Converting and calculating columns.

Transforming with Character Map.

Combining and splitting data.

Profiling data-Merge, Union, Conditional Split.

Multicasting and converting data.

Manipulating row sets and BLOB data.

Aggregate, sort, audit and look up data-Importing and exporting BLOB data.

Redirecting error rows.

Performing database operations.

Implementing Change Data Capture (CDC).

Executing a SQL task.

Bulk inserting data from text files.

Error Handling, Logging and Transactions.

Organizing package work flow.

Defining success, failure, completion and expression precedence constraints

Handling events and event bubbling.

Designing robust packages.

Choosing log providers.

Adapting solutions with package configurations.

Auditing package execution results.

Administering Business Intelligence.

Managing and securing packages.

Storing packages in Package Store and msdb-Encrypting packages with passwords and user keys.

Integrating with other BI components.

Displaying data in Reporting Services.

Accessing package data with ADO.NET

Module 1

Pentaho Data Integration Overview.

Exercise 1- Introducing Pentaho Data Integration

Module 2

Inputs and Outputs

Module 3

Introduction to the Training Data

Exercise 2-Inputs and Outputs

Module 4

Data Warehouse Steps

Exercise 3-Data Warehouse Steps

Module 5

Lookups

Module 6

Field Transformations, Part 1

Exercise 4-Lookups and Field Transformations

Module 7

Set Transformations

Exercise 5-Set Transformations

Module 8

Pivot Transformations

Exercise 6-Pivot Transformations

Module 9

Field Transformations, Part 2

Module 10

Loading the Time Dimension and the Fact Table

Exercise 7-Loading a Fact Table

Module 11

Introduction to Jobs

Exercise 8-Creating a Job

Module 12

Advanced Job Concepts

Exercise 9- Advanced Job Concepts

Module 13

Common Scripting Uses

Exercise 10-Using JavaScript

Module 14

Dynamic Transformations

Module 15

Using XML in Pentaho Data Integration

Exercise 11-Using XML

Module 16

Portable Transformations and Jobs

Exercise 12-Portable Transformations and Jobs

Module 17

Logging

Exercise 13-Configuring Logging

Module 18

Error Handling in Transformations

Exercise 14-Error Handling in Transformations

Module 19

ETL Patterns

Exercise 15-Calculating Time Between Orders

(Optional) Module 20

Pentaho Enterprise Repository

(Optional) Exercise 16- Pentaho Enterprise Repository

Module 21

Scheduling and Monitoring

Exercise 17-Scheduling and Monitoring

Module 22

Pre and Post-Processing

Exercise 18-Constraint and Index Management

(Optional) Module 23

Tuning and Administration Topics

Module 24

Interpreting Runtime Data

Module 25

Clustering and Partitioning

Exercise 19-Clustering and Partitioning

Informatica Data quality.

Quality is a relative and never-ending judgment, one that needs to be defined by the business (or business unit) that’s consuming the data. An essential element of holistic data governance, trustworthy data serves critical business needs across the enterprise—from legal to finance to marketing and beyond.

Informatica Data Quality

Provides all your projects and initiatives with clean, trusted data to meet your business objectives, regardless of size, format, or platform of your data

Course Duration : 20 days or 8 weekends or 45 hrs duration

Course content : Mail to : info@geoinsyssoft.com