Introduction

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.

Task

In this article we will setup Apache Flume to ingest data in real time into BigInsights.
Please download the VM for BigInsights Quickstart version here to try it.

We will create a Flume Agent which is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop). Flume agent configuration is stored in a local configuration file. This is a text file that follows the Java properties file format. Configurations for one or more agents can be specified in the same configuration file. The configuration file includes properties of each source, sink and channel in an agent and how they are wired together to form data flows.

Procedure

Do all the following steps as biadmin since it has all the correct env. Variables set.

Step 1:

– Download a set of sample data from
http://www.briandunning.com/sample-data/
– Download the free files

Step 2:

– Create a folder
Flumeingestion
Inside /home/biadmin using mkdir command

– Copy the downloaded files to your Biginsights server into above created folder.

Step 3:

Now in the next steps, we will configure flume properties to ingest the files in the above folder into hdfs

– Go into /opt/ibm/biginsights/flume/conf
– Create copy of flume-conf.properties.template and name it flume.conf

cp flume-conf.properties.template flume.conf

– Create copy of flume-env.sh.template and name it flume-env.sh
cp flume-env.sh.template flume-env.sh

Step 4:

– Edit the flume-env.sh file, uncomment the line JAVA_HOME and type the value
JAVA_HOME = /opt/ibm/biginsights/jdk/
– Uncomment the line JAVA_OPTS, so that it looks like
JAVA_OPTS="-Xms100m -Xmx200m -Dcom.sun.management.jmxremote"
– Leave rest of the file as it is.

Step 5:

– Edit flume.conf file to point to source directory (flumeingestion), a sink directory (hdfs dir) and channel which needs to be used to transfer the data. Remove all the content from it and copy this configuration in it. Before copying, please complete the hdfs path by providing the servers ipaddress


agent.sources = flumesource
agent.channels = memoryChannel
agent.sinks = flumeHDFS
#
# # For each one of the sources, the type is defined
agent.sources.flumesource.type = spooldir
agent.sources.flumesource.spoolDir = /home/biadmin/flumeingestion/
agent.sources.flumesource.bufferMaxLineLength = 80000

# The channel can be defined as follows.
agent.sources.flumesource.channels = memoryChannel
# Each sink’s type must be defined
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path = hdfs://IPAddress:9000/user/biadmin/flume
agent.sinks.flumeHDFS.hdfs.fileType = DataStream
#
# #Format to be written
agent.sinks.flumeHDFS.hdfs.writeFormat = Text
agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 10
# # rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollSize = 10485760
#
# # never rollover based on the number of events
agent.sinks.flumeHDFS.hdfs.rollCount = 0
#
# # rollover file based on max time of 1 mi
agent.sinks.flumeHDFS.hdfs.rollInterval = 60
#
#
# #Specify the channel the sink should use
agent.sinks.flumeHDFS.channel = memoryChannel
#
# # Each channel’s type is defined.
agent.channels.memoryChannel.type = memory
#
# # Other config values specific to each type of channel(sink or source)
# # can be defined as well
# # In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 200000
agent.channels.memoryChannel.transactionCapacity = 160000
agent.channels.memoryChannel.byteCapacity = 0
agent.channels.memoryChannel.memoryCapacity = 0

# # connect source and sink
agent.sinks.flumeHDFS.channel = memoryChannel

Step 6:

– Create a folder in your hdfs file system which will be our sink directory. In the above configuration, we have specified
hdfs://IPAddress:9000/user/biadmin/flume
as the hdfs path. So create a directory under /user/biadmin and name it flume

Step 5:

– Run the command to finally run the flume instance.
You need to get to /opt/ibm/biginsights/flume/bin and then run :

./flume-ng agent -c /opt/ibm/biginsights/flume/conf -f /opt/ibm/biginsights/flume/conf/flume.conf -n agent

Step 6:

– Now your flume instance is running and it has started to copy data from flumeingestion directory. Please check your hdfs /user/biadmin/flume folder to see the data.
– If you check your server file system, you should see the text files being renamed as
Yourfile.complete which indicated that it has been processed.

Additional Step:

Now, try putting some files in flumeingestion directory in realtime and you should see them in hdfs specified location in 1 or 2 seconds.

Note:
The logs are in /opt/ibm/biginsights/flume/bin/logs
If you don’t see the expected result please check the logs.

2 comments on"Setting up Apache Flume for Real Time Data Ingestion into IBM BigInsights(hadoop)"

  1. Satish Nigude December 19, 2016

    worked perfectly. Thanks a lot !!

  2. Hi article is nice, for apache flume setup and configuration it can be use full to ingest twitter data.

Join The Discussion

Your email address will not be published. Required fields are marked *