Apache Nifi Data Flow


Apache NiFi is a dataflow system based on the concepts of flow-based programming. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. NiFi has a web-based user interface for design, control, feedback, and monitoring of dataflows. It is highly configurable along several dimensions of quality of service, such as loss-tolerant versus guaranteed delivery, low latency versus high throughput, and priority-based queuing. NiFi provides fine-grained data provenance for all data received, forked, joined cloned, modified, sent, and ultimately dropped upon reaching its configured end-state.
Put simply NiFi was built to automate the flow of data between systems. While the term dataflow is used in a variety of contexts, we use it here to mean the automated and managed flow of information between systems. This problem space has been around ever since enterprises had more than one system, where some of the systems created data and some of the systems consumed data.
This blog will describes a simple data flow from Local file system to HDFS using Apache Nifi.
Before we get into exercise first we must to aware of the basic components of Apache Nifi.
NiFi Components:
NiFi provides several extension points to provide developers the ability to add functionality to the application to meet their needs. The following list provides a high-level description of the most common extension points:
Processor
The Processor interface is the mechanism through which NiFi exposes access to FlowFiles, their attributes, and their content. The Processor is the basic building block used to comprise a NiFi dataflow. This interface is used to accomplish all of the following tasks:
Create FlowFiles
Read FlowFile content
Write FlowFile content
Read FlowFile attributes
Update FlowFile attributes
Ingest data
Egress data
Route data
Extract data
Modify data
ReportingTask
The ReportingTask interface is a mechanism that NiFi exposes to allow metrics, monitoring information, and internal NiFi state to be published to external endpoints, such as log files, e-mail, and remote web services.
ControllerService
A ControllerService provides shared state and functionality across Processors, other ControllerServices, and ReportingTasks within a single JVM. An example use case may include loading a very large dataset into memory. By performing this work in a ControllerService, the data can be loaded once and be exposed to all Processors via this service, rather than requiring many different Processors to load the dataset themselves.
FlowFilePrioritizer
The FlowFilePrioritizer interface provides a mechanism by which FlowFiles in a queue can be prioritized, or sorted, so that the FlowFiles can be processed in an order that is most effective for a particular use case.
AuthorityProvider
An AuthorityProvide is responsible for determining which privileges and roles, if any, a given user should be granted.
Steps to Workout:
1.Login to Nifi
2.Drag Processor GetFile
3.Drag Processor Put HDFS
4.Configuring GetFile processor
5.Configuring HDFS Processor
6.Required input file path
7.Make a connection
8.Start the processors
9.Check Output File Location
Logging In
Clicking the login link will open the log in page. If the user is logging in with their username/password they will be presented with a form to do so.
Nifi API
Building a DataFlow:
A DFM is able to build an automated dataflow using the NiFi UI. Simply drag components from the toolbar to the canvas, configure the components to meet specific needs, and connect the components together.
Adding Components to the Canvas
1_2_selecting_getfile_processor
Processor iconProcessor:
components
The Processor is the most commonly used component, as it is responsible for data ingress, egress, routing, and manipulating. There are many different types of Processors. In fact, this is a very common Extension Point in NiFi, meaning that many vendors may implement their own Processors to perform whatever functions are necessary for their use case. When a Processor is dragged onto the canvas, the user is presented with a dialog to choose which type of Processor to use:
Clicking the Add button or double-clicking on a GetFile Processor Type will add the selected Processor to the canvas at the location that it was dropped.
Clicking the Add button or double-clicking on a PutHDFS Processor Type will add the selected Processor to the canvas at the location that it was dropped.
Once you have dragged a Processor onto the canvas, you can interact with it by right-clicking on the Processor and selecting an option from the context menu.
Configure:
This option allows the user to establish or change the configuration of the Processor.
Start or Stop:
This option allows the user to start or stop a Processor; the option will be either Start or Stop, depending on the current state of the Processor.
Stats:
This option opens a graphical representation of the Processor’s statistical information over time.
Data provenance:
This option displays the NiFi Data Provenance table, with information about data provenance events for the FlowFiles routed through that Processor
Usage:
This option takes the user to the Processor’s usage documentation.
Change color:
This option allows the user to change the color of the Processor, which can make the visual management of large flows easier.
Center in view:
This option centers the view of the canvas on the given Processor.
Copy:
This option places a copy of the selected Processor on the clipboard, so that it may be pasted elsewhere on the canvas by right-clicking on the canvas and selecting Paste. The Copy/Paste actions also may be done using the keystrokes Ctrl-C (Command-C) and Ctrl-V (Command-V).
Delete:
This option allows the DFM to delete a Processor from the canvas.
Configuring a GetFile Processor:
2_selecting_getfile_processor
To configure a processor, right-click on the Processor and select the Configure option from the context menu. The configuration dialog is opened with four different tabs, each of which is discussed below. Once you have finished configuring the Processor, you can apply the changes by clicking the Apply button or cancel all changes by clicking the Cancel button.
3.1_right_click_contextmenu
Note that after a Processor has been started, the context menu shown for the Processor no longer has a Configure option but rather has a View Configuration option. Processor configuration cannot be changed while the Processor is running. You must first stop the Processor and wait for all of its active tasks to complete before configuring the Processor again.
4_configure_processor
Settings Tab
This tab contains several different configuration items. First, it allows the DFM to change the name of the Processor. The name of a Processor by default is the same as the Processor type. This tab also includes other configuration settings like auto terminate relationship,penalty duration, yield duration, bulletin level(refer Apache User Guide for more detail).
Scheduling Tab
The second tab in the Processor Configuration dialog is the Scheduling which contains the different types of scheduling such as Time Driven , Event Driven and Cron Driven.
Properties Tab
The Properties Tab provides a mechanism to configure Processor-specific behavior. There are no default properties. Each type of Processor must define which Properties make sense for its use case.
5_configure_processor_properties
Here we have to configure properties for a GetFile and PutHDFS Processors.
GetFile Processor Properties:
Description:
Creates FlowFiles from files in a directory. NiFi will ignore files it doesn’t have at least read permissions for.

Name Default Value Allowable Values Description
Input Directory     The input directory from which to pull files
Supports Expression Language: true
File Filter [^\.].*   Only files whose names match the given regular expression will be picked up
Path Filter     When Recurse Subdirectories is true, then only subdirectories whose path matches the given regular expression will be scanned
Batch Size 10   The maximum number of files to pull in each iteration
Keep Source File FALSE true
false
If true, the file is not deleted after it has been copied to the Content Repository; this causes the file to be picked up continually and is useful for testing purposes. If not keeping original NiFi will need write permissions on the directory it is pulling from otherwise it will ignore the file.
Recurse Subdirectories TRUE true
false
Indicates whether or not to pull files from subdirectories
Polling Interval 0 sec   Indicates how long to wait before performing a directory listing
Ignore Hidden Files TRUE true
false
Indicates whether or not hidden files should be ignored
Minimum File Age 0 sec   The minimum age that a file must be in order to be pulled; any file younger than this amount of time (according to last modification date) will be ignored
Maximum File Age     The maximum age that a file must be in order to be pulled; any file older than this amount of time (according to last modification date) will be ignored
Minimum File Size 0 B   The minimum size that a file must be in order to be pulled
Maximum File Size     The maximum size that a file can be in order to be pulled

6_configure_properties_getfile

Here we configure Input Directory location (**/home/hduser/details_console) of the file where it is stored on the local disk. Other required properties we can override as per our requirements. Here i leave it as default configurations.

PutHDFS Processor Properties:
3_put_hdfs_processor

Name Default Value Allowable Values Description
Hadoop Configuration Resources     A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a ‘core-site.xml’ and ‘hdfs-site.xml’ file or will revert to a default configuration.
Kerberos Principal     Kerberos principal to authenticate as. Requires nifi.kerberos.krb5.file to be set in your nifi.properties
Kerberos Keytab     Kerberos keytab associated with the principal. Requires nifi.kerberos.krb5.file to be set in your nifi.properties
Kerberos Relogin Period

4 hours

  Period of time which should pass before attempting a kerberos relogin
Additional Classpath Resources     A comma-separated list of paths to files and/or directories that will be added to the classpath. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.
Directory     The parent HDFS directory to which files should be written
Supports Expression Language: true
Conflict Resolution Strategy fail replace
ignore
fail
append
Indicates what should happen when a file with the same name already exists in the output directory
Block Size     Size of each block as written to HDFS. This overrides the Hadoop Configuration
IO Buffer Size     Amount of memory to use to buffer file contents during IO. This overrides the Hadoop Configuration
Replication     Number of times that HDFS will replicate each file. This overrides the Hadoop Configuration
Permissions umask     A umask represented as an octal number which determines the permissions of files written to HDFS. This overrides the Hadoop Configuration dfs.umaskmode
Remote Owner     Changes the owner of the HDFS file to this value after it is written. This only works if NiFi is running as a user that has HDFS super user privilege to change owner
Remote Group     Changes the group of the HDFS file to this value after it is written. This only works if NiFi is running as a user that has HDFS super user privilege to change group
Compression codec NONE NONE
DEFAULT
BZIP
GZIP
LZ4
SNAPPY
AUTOMATIC
No Description Provided.

7_hdfs_properties
Here we configure Hadoop Configurations Resources directory (**/home/hduser/hadoop-2.7.1/etc/hadoop/core-site.xml and hdfs-site.xml) and destination directory (**nifi/nifi_put_hdfs_ex) where to store the file in HDFS. We can override other configuration properties, here i configured with default configurations properties.
Auto-terminate relationships:
9_auto_terminate_relationship
Each of the Relationships that is defined by the Processor is listed here, along with its description. In order for a Processor to be considered valid and able to run, each Relationship defined by the Processor must be either connected to a downstream component or auto-terminated.
If a Relationship is auto-terminated, any FlowFile that is routed to that Relationship will be removed from the flow and its processing considered complete. Any Relationship that is already connected to a downstream component cannot be auto-terminated.
Here we checked the success and failure checkboxes for auto terminate the data flow whether the flow may get success or failure.
Comments Tab
The last tab in the Processor configuration dialog is the Comments tab. This tab simply provides an area for users to include whatever comments are appropriate for this component.
Connecting Components
Once processors and other components have been added to the canvas and configured, the next step is to connect them to one another so that NiFi knows what to do with each FlowFile after it has been processed. This is accomplished by creating a Connection between each component. When the user hovers the mouse over the center of a component, a new Connection icon appears
The user drags the Connection bubble from one component to another until the second component is highlighted. When the user releases the mouse, a Create Connection dialog appears. This dialog consists of two tabs: ‘Details’ and ‘Settings’.
Here we configured create connection dialog box as per the default details and settings configurations.
8_create_connection
 
Processor Validation
Before trying to start a Processor, it’s important to make sure that the Processor’s configuration is valid. A status indicator is shown in the top-left of the Processor. If the Processor is invalid, the indicator will show a red Warning indicator with an exclamation mark indicating that there is a problem.
In this case, hovering over the indicator icon with the mouse will provide a tooltip showing all of the validation errors for the Processor. Once all of the validation errors have been addressed, the status indicator will change to a Stop icon, indicating that the Processor is valid and ready to be started but currently is not running.
Command and Control of the DataFlow
When a component is added to the NiFi canvas, it is in the Stopped state. In order to cause the component to be triggered, the component must be started. Once started, the component can be stopped at any time. From a Stopped state, the component can be configured, started, or disabled.
Starting a Component
In order to start a component, the following conditions must be met:
The component’s configuration must be valid.
All defined Relationships for the component must be connected to another component or auto-terminated.
The component must be stopped.
The component must be enabled.
The component must have no active tasks.
Components can be started by selecting all of the components to start and then clicking the Start icon ( ) in the Actions Toolbar or by right-clicking a single component and choosing Start from the context menu.
Starting GetFile Processor:
11_starting_getfile
Starting PutHDFS Processor:
12_starting_puthdfs
Input File in Local Ubuntu:
10_input_file_location
HDFS State before start Data Flow:
14
Local File Directory in ubuntu after starting Data Flow:
13_after_starting
HDFS State after starting Data Flow:
15
14_in_hdfs
Stopping a Component
A component can be stopped any time that it is running. A component is stopped by right-clicking on the component and clicking Stop from the context menu, or by selecting the component and clicking the Stop icon ( ) in the Actions Toolbar.
Save