This document summarizes Burak IŞIKLI and Erkan HASPULAT's presentation on Turkcell's use of Hadoop for customer analytics. Turkcell processes 15TB of data and 6TB of logs daily using Hadoop on a 350-node cluster. They use Hadoop for ETL, analytics like fraud detection and recommendations, and a data lake. Location analysis joins call data to determine subscribers' locations, processing 1.5TB daily. As data volumes grew they upgraded Hadoop and expanded storage from 15TB to 698TB. They also developed indexes for movement analysis and industry comparisons using Hive and Spark.
7. Issues
• Default python is 2.6 but Spark Ipython works
with Python 2.7+
• Security&Auditing Issues
Copying Data by Masking
Dynamic Data Masking
SOX Compliance
8. Location Analysis
Find the subscriber’s location
using cell information.
11 billions rows/day
0.5 Tb/day
2.5 hours processing time
Hadoop Streaming w/Perl
Sqoop
16. Perl Job 212 min 104 min
Pig Job 77 min 18 min
Upgrade
• No space left on disk!
• Hadoop upgrade
0.23 -> 1.3.1 -> 2.7.1
• Linear scalability
Nodes 1M + 4D 1M + 1SM + 16D
Disk 15 TB 698 TB
CPU 20 Core 224 Core
Memory 1024 GB 1.5TB
Version 0.23 HDP 2.3.2
19. Industry-Specific Analysis
But how?
Hive external partition
ALTER TABLE t1 ADD
PARTITION(DAILY_CALENDAR_ID=‘20160101')
LOCATION '/user/…/tlc/daily_calendar_id=20160101'"
20. Movement Index
Subscribers journeys is
provided to determine an
analysis with which they
transport between cities
via signaling data
• Airline companies
• Bus companies
• Local government
• Survey companies
21. Movement Index
Simple Euclidean Distance
Equ: https://en.wikipedia.org/wiki/Euclidean_distance
Finding the change of location
First, find out the closeness of each cell using coordinates
24. Movement Index
But one problem!
java.io.IOException:java.lang.
IllegalArgumentException:
Column [daily_calendar_id]
was not found in schema!
Are you kidding me!!
Img: http://bit.ly/1N3QKRQ
25. Movement Index
Just another bug:
HIVE-11401
Workaround solution:
hive.optimize.index.filter=false;
Permanent solution:
Hive 2.0
Img: http://bit.ly/1W1USZM
Hi, I know it’s one of the latest session and you’re tired I’ll be brief as possible as I can. How do you decide where your customer was? I'm gonna be talking about location analysis and hadoop architecture. Let me begin by telling about outselves and Turkcell.
My name is Burak ISIKLI. I'm a senior software engineer in Turkcell for 7 years and I'm also phd candidate. My colleague name is Erkan Haspulat. He's also a senior software engineer in Turkcell. Unfortunately he's not here today because he broke his leg.
Turkcell is one of the leading telecommunication company in Turkey and second largest telco company in Europe. There are 34.7 mio subs in Turkey and almost 72 mio subscribers in 9 countries. It is the first and only Turkish company ever to be listed on the NYSE.
So let get started with our hadoop journey. It's started in 2013 with relatively small cluster. 15Tb data is processing per day. 6TB data is collecting from source servers, approximetly 350 jobs are running every day.
We've done a lot of things for ETL side, CDR procesing, log analysis, Analytics side fraud analytics..., and also data lake to our customer journey project. In order to do these projects, which Architecture do we used?
This is abstract view of architecture for Turkcell. We're collecting a lot of huge logs. We used unix and internal tools for collecting logs and for extracting data from database Tuna. Tuna is our extraction and loading platform on top of Sqoop. To transformation the data hive, pig, hadoop streaming with perl, and spark is used. After data is processed, either we load into database which is Oracle for us for reporting and dashboards or data is sent to other systems or triggered to some alarm mechanism.
Our system is Teradata Appliance which is included Hortonworks HDP 2.3.2. It may limit you someway for instance the system default python is 2.6 and spark ipython is working with 2.7 or higher. If you do data science things, Spark Ipython help you a lot. We have still security issues because the system has to be SOX compliance. As you can see there are lots of incubating projects such as sentry, recordservice, atlas, ranger and so on but we couldn't find any solution on copying data to development cluster by masking or dynamic data masking. It's still under development.
I present a couple use case and how we do it. First use case is finding the subscriber's location. When your telephone is on, it sends signals including information about which base station is attached even if you're not talking. CDR systems transforms these information into logs. After that only we can do is collect and process them. 11 billions rows, 0.5 tb data is processed per day. It's taking 2.5 hours using hadoop streaming with perl. The output is loaded into database using Tuna. How do we collect these logs?
This is a picture of generating of Call Detail Record. Subscriber calls to someone. System generates XDR or CDR. When a xdr is generated, it's ready to ship so transferring process is started using FTP/SCP and put into HDFS. It's a recursive process in order to catch of the day.
Why don't we use Flume. Flume doesn't allow shipping files unless we install on sth to servers. We can't install anything on serves because its main objective is providing the service or our subscribers may effect on it. So we used unix tools such as ftp, scp, rsync with some modification. I said some modification because rsync and ftp works serials we need to transfer the data as soon as possible. Data is also combined to fit the data blocks.
We've done an experiment. We put 1mb per file into HDFS without combining every 15 seconds. After that we tried to process it but it fails out of memory exception. It's easy right. We increase the heap memory size but it fails again. The message is failed reallocation of scalar replaced objects which related to Java 8 bug. This is why we're still continuing this architecture and on the other side we’re still trying and experimenting the new developments.
How do you join using perl? Joining is like keeping pieces together just like puzzles. In the mapper phase, data is splitted into two parts header and trailer. In the header part, join column and if you're joining more than two files filename. The rest of file is in the trailer part. That's all.
After we've done this project, it got a lot attention used a lot of projects. We realized that it's like a tree in the forest. There may be a lot of other coming projects
But it's not only project working on this cluster. The data size on this cluster is growing too fast. Just a real world example, within six months signalling data we called sentinel logs grows 1TB to 2TB in daily basis. Currently the size 4.5tb. Furthermore we're not using LTE.
I came from Istanbul. Our nightmare is crossing the continent. As you can see in the picture why I’m saying that. There are lots of options crossing the continent by bus, by car, by ferry, by tram etc. People who lives in Istanbul decides which is the best way to go. Because there is bottleneck in the bridges and bridges can’t handle these crowded people. So we also must be adapted to this volume without creating a bottleneck.
In old days we still had a relatively small cluster. Moreover we've seen no space left on disk error! We bought a brand new cluster. We left the old system which has 4 data nodes and a master node, upgrade it to new system. New system is a little bit bigger which has 16 data nodes 1 master node and 1 secondary node. The main pros in this system is seeing the linear scability. Our jobs has improved. For instance a perl job is taking 212 minutes to 104 minutes.
But new projects keep coming. Next scen is industry-spesific analysis. It's a competitor analysis. There are lots of shopping centers in Istanbul. Customers are viewing their point of view based on age, income, job, sex and so on. A pretty print screen is on left of the slide.
But how should we implement it? As you can imagine, this is urgent request. It has to be done as soon as possible. But output is on the hdfs file. If we do it using perl, it takes time. Do we need to implement the project we've already implemented before using hive or pig? It still takes more time.
Solution comes with hive external partition. We create a table with external partition pointing the output directory.
Next use case is movement index. In this project subscribers journeys is provided to determine an analysis with which they transport between cities via signaling data. In brief summary we need to determine change of city. Product is used by Airline companies, Bus companies, Local government, Survey companies.
We know the base station's coordinates. It can be found closeness of each other using simple euclidean distance. For those who do not remember, the formula is on the slide.
It's simple cross join. You can see the hive query.
Finally, all needs to be done is simple another query right? Everything will be fine, what do you think?
But there is a problem. Hive doesn't find the partition column. But, it's there. What the hell? Are you kidding me?
When you google it, it turns out that is bug related to parquet format. Workaround solution is easy, setting index filter to false. But we can't use file format indexes when it's turned off. Permanent solution is in hive 2.0 so that we had to use it until the next upgrade.
Is it enough? Hell no, there are lots of ongoing projects.
Movement prediction which is predicting the subscriber's location.
Real time location analysis with historic data. Customers want to query the subscriber’s who is in Shopping Mall 15 minutes and also have been before about 1 month ago.
We're trying Hadoop on SQL tools such as Spark SQL, Impala for ad-hoc query.
This is a teamwork. I thank my other collogues Caner and Uğur for their help.
That's all. Thank you very much for listening to me.