Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
In this article we will setup Apache Flume to ingest data in real time into BigInsights.
Please download the VM for BigInsights Quickstart version here to try it.
We will create a Flume Agent which is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop). Flume agent configuration is stored in a local configuration file. This is a text file that follows the Java properties file format. Configurations for one or more agents can be specified in the same configuration file. The configuration file includes properties of each source, sink and channel in an agent and how they are wired together to form data flows.
Do all the following steps as biadmin since it has all the correct env. Variables set.
– Download a set of sample data from
– Download the free files
– Create a folder
/home/biadmin using mkdir command
– Copy the downloaded files to your Biginsights server into above created folder.
Now in the next steps, we will configure flume properties to ingest the files in the above folder into hdfs
– Go into
– Create copy of
flume-conf.properties.template and name it
cp flume-conf.properties.template flume.conf
– Create copy of
flume-env.sh.template and name it
cp flume-env.sh.template flume-env.sh
– Edit the flume-env.sh file, uncomment the line JAVA_HOME and type the value
JAVA_HOME = /opt/ibm/biginsights/jdk/
– Uncomment the line JAVA_OPTS, so that it looks like
JAVA_OPTS="-Xms100m -Xmx200m -Dcom.sun.management.jmxremote"
– Leave rest of the file as it is.
– Edit flume.conf file to point to source directory (flumeingestion), a sink directory (hdfs dir) and channel which needs to be used to transfer the data. Remove all the content from it and copy this configuration in it. Before copying, please complete the hdfs path by providing the servers ipaddress
agent.sources = flumesource
agent.channels = memoryChannel
agent.sinks = flumeHDFS
# # For each one of the sources, the type is defined
agent.sources.flumesource.type = spooldir
agent.sources.flumesource.spoolDir = /home/biadmin/flumeingestion/
agent.sources.flumesource.bufferMaxLineLength = 80000
# The channel can be defined as follows.
# # connect source and sink
– Create a folder in your hdfs file system which will be our sink directory. In the above configuration, we have specified
as the hdfs path. So create a directory under
/user/biadmin and name it
– Run the command to finally run the flume instance.
You need to get to
/opt/ibm/biginsights/flume/bin and then run :
./flume-ng agent -c /opt/ibm/biginsights/flume/conf -f /opt/ibm/biginsights/flume/conf/flume.conf -n agent
– Now your flume instance is running and it has started to copy data from flumeingestion directory. Please check your hdfs
/user/biadmin/flume folder to see the data.
– If you check your server file system, you should see the text files being renamed as
Yourfile.complete which indicated that it has been processed.
Now, try putting some files in flumeingestion directory in realtime and you should see them in hdfs specified location in 1 or 2 seconds.
The logs are in /opt/ibm/biginsights/flume/bin/logs
If you donâ€™t see the expected result please check the logs.