Extracting twitter data using apache flume

Extracting Twitter Data
using Apache Flume
By Bharat Khanna
Talend ETL Developer

What you need ??
• Horton works Hadoop Cluster :- HDP 1.3
• Oracle Virtual Box
• Putty
• Winscp
• Maven (for creating flume-snapshot.jar)

What is Flume ?
• Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and
flexible architecture based on streaming data flows.

Network Settings at Oracle Virtual Box

Network Settings at Oracle Virtual Box Contd..

Getting Started
• Run your Hadoop Cluster in Virtual Box. Once it is started, make sure you
are able to connect to HDFS from your host windows machine by giving
address as something like http://192.168.56.101:8000.
• This IP address you will get when you run ifconfig command in your
Hadoop cluster once it is started.

File Browser using HUE
• Your HDFS interface from host machine may look like below: -

Setting your bash_profile in Putty
• It is important to set environment variables by editing bash_profile that can
edited using command “vi .bash_profile”(You need dot before bash_profile
as by default it is hidden) at your home directory. Exclude Maven_Home
below for now.

Creating Flume Snapshot.jar
• This jar contains necessary libraries for proper functioning of Flume. This
can be either downloaded by googling or we can create it ourselves. Best is to
create it ourselves.
• You need Maven software for this. If your java version is 1.6, which is in
Hortonworks HDP 1.3 , then download archived version of Maven i.e. 3.0.5
from http://archive.apache.org/dist/maven/maven-3/ else use any latest
version.

Creating Flume Snapshot.jar Contd..
• Once download, unzip the folder in windows, and transfer it to your
Hortonworks cluster using Winscp.
• Create a link to the folder by command “ln -s apache-maven-3.0.5 maven” in
your home directory folder.
• Set the path of this link in your bash_profile as shown in slide 8.
• Logoff and login again to Unix session after saving your bash_profile to
implement changes. Run command “mvn -version” to check its working.

Creating Flume Snapshot.jar Contd..
• Download Cloudera’s Twitter Code zip file from
https://github.com/cloudera/cdh-twitter-example.
• Unzip it and transfer it to your home directory in Hortonworks cluster using
Winscp.
• Go to flume-sources folder under folder cdh-twitter-example-master and run
command “mvn package” to build the flume snapshot.jar file. This file can
be found under target folder in same directory.

Configuring Flume
• Transfer the flume-sources-1.0-SNAPSHOT.jar to lib directory of flume under
location /etc/flume/apache-flume-1.6.0-bin/lib for Hortonworks 1.3 VM.
• Flume’s configuration directory can be found at /etc/flume/apache-flume-1.6.0-
bin/conf.
• Open flume-env.sh.template file in vi editor , set Java_Home Path as defined in the
bash_profile and Flume Classpath as the path of flume-snapshot.jar in double
quotes.
• Rename flume-env.sh.template to flume-env.sh using mv command.

Configuring Flume contd..
• You also need to transfer following jar files to flume lib folder.
Jar From Directory
hadoop-core.jar HADOOP_HOME i.e. /usr/lib/hadoop
hadoop-client-1.2.0.1.3.0.0-107.jar HADOOP_HOME i.e. /usr/lib/hadoop
jets3t-0.6.1.jar /usr/lib/hadoop/lib
commons-httpclient-3.0.1.jar /usr/lib/hadoop/lib
commons-configuration-1.6.jar /usr/lib/hadoop/lib
commons-codec-1.4.jar /usr/lib/hadoop/lib

Creating Twitter App
• Go to dev.twitter.com and click on create a new app.
• Give your name , description and website may be like
http://yourdomain.com.
• After creating app, go to Keys and Access tokens and create your consumer
key , consumer secret , access token and access token secret.
• Make a note of it as you need that in subsequent steps.

Creating conf file
• Go to folder , /etc/flume/apache-flume-1.6.0-bin/conf and open a new file
named Twitter.conf.
• A Sample Image of it is shown in next slide. You need to insert your
consumer key , consumer secret , access token and access token secret that
you got in previous step.
• Then you need to enter keywords for which you want to analyze the data.
• At last, you need to give your hdfs path that you can get from fs.default.name
in core-site.xml file under Hadoop_Home/conf i.e. /usr/lib/hadoop/conf

Checks before running flume-Setting Timezone
• Make sure that the time being shown in your VM matches with what you can see in
your local machine. If they are not, you need to reset the time as shown below. You
can time in your VM by “date” command.
• If your Timezone is matching , you can skip next 2 steps.
• Time zone is controlled by /etc/localtime file. You can check the list of timezones
available under /usr/share/zoneinfo/ directory.
• cd /etc
• ln -s /usr/share/zoneinfo/US/Eastern localtime

Checks before running flume-Setting Oracle
Virtual Box Properties
• You need to make sure that you can always reset your time in VM as you
have done in previous step. For that you need to set following properties at
VirtualBox.
• In Windows, start a command line interpreter, go to C:Program
FilesOracle folder and click VirtualBox to select, then holding left shift key,
do a mouse right-button click and select "Open command window here"
menu, the interpreter has to be running now.

Checks before running flume-Setting Oracle
Virtual Box Properties Contd..
• Run following commands in command prompt.
VBoxManage setextradata ${VMNAME}
"VBoxInternal/Devices/VMMDev/0/Config/GetHostTimeDisabled" 1
$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--
timesync-interval" 10000
timesync-min-adjust" 100
timesync-set-on-restore" 1
timesync-set-threshold" 1000

Running Flume
• Go to flume bin directory and run the flume agent using following
command:-
• flume-ng agent -n TwitterAgent -c conf -f /etc/flume/apache-flume-1.6.0-
bin/conf/twitter.conf
• After sometime, you may start getting files like below under directory
specified in conf file.

Error Catalog
• You may face following frequently occurring errors while running flume.
Apache flume Error - java.lang.NoSuchMethodError:
twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j FilterQuery
Fix :- This happens because of FilterQuery.class occurring in two different jars( one
of which will be flume-snapshot.jar) .
You can search for those clashing jars using command :- “find . -name "*.jar" | xargs
grep FilterQuery.class” under lib directory of flume.
Rename the other jar by suffixing jar name with .org.

Error Catalog Contd..
• Apache flume Error :- java.io.IOException: Callable timed out after 10000
ms on file:
Fix :- This happens because of too many connections to twitter from your
account. Just wait for some time and try again.

Extracting twitter data using apache flume

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Extracting twitter data using apache flume

Semelhante a Extracting twitter data using apache flume (20)

Último

Último (20)

Extracting twitter data using apache flume