How do you decide where your customer was?

•Download as PPTX, PDF•

0 likes•619 views

This document summarizes Burak IŞIKLI and Erkan HASPULAT's presentation on Turkcell's use of Hadoop for customer analytics. Turkcell processes 15TB of data and 6TB of logs daily using Hadoop on a 350-node cluster. They use Hadoop for ETL, analytics like fraud detection and recommendations, and a data lake. Location analysis joins call data to determine subscribers' locations, processing 1.5TB daily. As data volumes grew they upgraded Hadoop and expanded storage from 15TB to 698TB. They also developed indexes for movement analysis and industry comparisons using Hive and Spark.

Technology

Burak IŞIKLI
Erkan HASPULAT
How Do You Decide
Where Your Customer
Was?

Who are we?
Burak IŞIKLI
Senior Software Engineer, Turkcell
burak.isikli at turkcell dot com.tr
@burakisikli, github.com/burakisikli
Erkan HASPULAT
Senior Software Engineer, Turkcell
erkan.haspulat at turkcell dot com.tr
@erkanhaspulat, github.com/ehaspulat

Turkcell
• It is the first and only Turkish company ever to be listed on the NYSE.

Our Hadoop Journey
2013-…
15TB data processing daily
6TB log transferring daily
350 jobs running jobs daily

Our Hadoop Journey
• ETL
 CDR Analysis, Log Processing...
• Analytics
 Fraud Analytics, Clickstream, Recommendation
Engine
• Data Lake
 Customer Journey

Architecture
RDBMS
Logs
Tuna
Unix/Internal
Tools
HADOOP
DB Dashboards
Archive Ad-Hoc
Analysis
Mining
External
Systems
Alarm
Systems

Issues
• Default python is 2.6 but Spark Ipython works
with Python 2.7+
• Security&Auditing Issues
 Copying Data by Masking
 Dynamic Data Masking
 SOX Compliance

Location Analysis
Find the subscriber’s location
using cell information.
 11 billions rows/day
 0.5 Tb/day
 2.5 hours processing time
 Hadoop Streaming w/Perl
 Sqoop

Recursive Process
1. Subscriber calls
2. XDR is generated
3. FTP/SCP process is started
4. Put into HDFS
Dispatcher

Dispatcher
Why not Flume?
Ftp, scp, rsync? Think about 6TB/day
Rsync and ftp works serial!

Dispatcher
Experiment - CDRs
Every 15 secs up to 1mb/file,
total 10mb gzipped files
java.lang.OutOfMemoryError
PIG_HEAPSIZE=8000
Failed reallocation of scalar
replaced objects
JDK-8145996 Img: http://bit.ly/1QSRkGn

Location Analysis
Join? Perl?
Mapper>Header/Trailer
JoinColumn-fileName-Rest
Img: http://bit.ly/1TrUJhv

A tree in the forest
Img: http://go.nasa.gov/1SX3wGl

Volume
May 2014 Jul. 2014 Nov. 2014
1 TB 1.5 TB 2 TB
Data size is growing too fast!
What about LTE?
Data
Time Mar. 2016
4.5 TB

Perl Job 212 min 104 min
Pig Job 77 min 18 min
Upgrade
• No space left on disk!
• Hadoop upgrade
0.23 -> 1.3.1 -> 2.7.1
• Linear scalability
Nodes 1M + 4D 1M + 1SM + 16D
Disk 15 TB 698 TB
CPU 20 Core 224 Core
Memory 1024 GB 1.5TB
Version 0.23 HDP 2.3.2

Industry-Specific Analysis
Competitor Comparison
E.g. Shopping center
comparison in Istanbul
based on city, district,
demographic
information (age, sex,
income, job… etc.)

Industry-Specific Analysis
But how?
Perl?
Hive or Pig?
What else?

Industry-Specific Analysis
But how?
 Hive external partition
ALTER TABLE t1 ADD
PARTITION(DAILY_CALENDAR_ID=‘20160101')
LOCATION '/user/…/tlc/daily_calendar_id=20160101'"

Movement Index
Subscribers journeys is
provided to determine an
analysis with which they
transport between cities
via signaling data
• Airline companies
• Bus companies
• Local government
• Survey companies

Movement Index
Simple Euclidean Distance
Equ: https://en.wikipedia.org/wiki/Euclidean_distance
Finding the change of location
First, find out the closeness of each cell using coordinates

Movement Index
Euclidean Distance Hive Query – Cross Join

Movement Index
Finally, all needs to be done
is simple another query
Img: http://bit.ly/1MKAiuT

Movement Index
But one problem!
java.io.IOException:java.lang.
IllegalArgumentException:
Column [daily_calendar_id]
was not found in schema!
Are you kidding me!!
Img: http://bit.ly/1N3QKRQ

Movement Index
Just another bug:
HIVE-11401
Workaround solution:
hive.optimize.index.filter=false;
Permanent solution:
Hive 2.0
Img: http://bit.ly/1W1USZM

Ongoing Projects
Movement Predicton
Spark ML
Real Time Location Analysis
Spark Streaming
Hadoop on SQL: Spark SQL,
Impala… etc.
Img: http://bit.ly/25DxPsq

Acknowledgements
Special thanks to
Caner CANAK
Uğur Cumhur ÇELİK
Img: http://bit.ly/1RVnde7

What's hot

HPE Keynote Hadoop Summit San Jose 2016DataWorks Summit/Hadoop Summit

Debunking Common Myths in Stream ProcessingDataWorks Summit/Hadoop Summit

Migrating pipelines into DockerDataWorks Summit/Hadoop Summit

Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit

Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...DataWorks Summit/Hadoop Summit

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit

Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit

Advanced Visualization of Spark jobsDataWorks Summit/Hadoop Summit

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.

From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...DataWorks Summit/Hadoop Summit

Large-scaled telematics analyticsDataWorks Summit

Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizonDataWorks Summit/Hadoop Summit

Producing Spark on YARN for ETLDataWorks Summit/Hadoop Summit

How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov

Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit

Streaming in the Wild with Apache FlinkDataWorks Summit/Hadoop Summit

Apache Pulsar: The Next Generation Messaging and Queuing SystemDatabricks

What's hot (20)

HPE Keynote Hadoop Summit San Jose 2016

Debunking Common Myths in Stream Processing

Migrating pipelines into Docker

Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez

Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...

Real time fraud detection at 1+M scale on hadoop stack

Advanced Visualization of Spark jobs

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...

From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...

Large-scaled telematics analytics

Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon

Producing Spark on YARN for ETL

How to use Parquet as a Sasis for ETL and Analytics

Spark Summit EU talk by Kaarthik Sivashanmugam

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...

Enabling Modern Application Architecture using Data.gov open government data

Streaming in the Wild with Apache Flink

Apache Pulsar: The Next Generation Messaging and Queuing System

Viewers also liked

Intro to H2O Machine Learning in Python - Galvanize SeattleSri Ambati

Cassandra Summit 2014: Turkcell Curio, Real-Time Targeted Mobile Marketing Pl...DataStax Academy

Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016Sri Ambati

Intro to H2O Machine Learning in R at Santa Clara UniversitySri Ambati

H2O Open Source Deep Learning, Arno Candel 03-20-14Sri Ambati

Intro to Machine Learning with H2O and Python - DenverSri Ambati

Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan

MLconf - Distributed Deep Learning for Classification and Regression Problems...Sri Ambati

Lessons Learned from Dockerizing Spark WorkloadsBlueData, Inc.

Turkcell Technology Presentationmgoksel

Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati, Sri Ambati

AdinActionCan Hüzmeli

Hadoop TechnologyEce Seçil AKBAŞ

Data Lakes: 8 Enterprise Data Management RequirementsSnapLogic

Big Data Science with H2O in RAnqi Fu

H2O Open New York - Keynote, Sri Ambati, CEO H2O.aiSri Ambati

H2O with Erin LeDell at Portland R User GroupSri Ambati

H2O World - Advanced Analytics at Macys.com - Daqing ZhaoSri Ambati

H2O World - Sparkling Water - Michal MalohlavaSri Ambati

Skutil - H2O meets Sklearn - Taylor SmithSri Ambati

Viewers also liked (20)

Intro to H2O Machine Learning in Python - Galvanize Seattle

Cassandra Summit 2014: Turkcell Curio, Real-Time Targeted Mobile Marketing Pl...

Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016

Intro to H2O Machine Learning in R at Santa Clara University

H2O Open Source Deep Learning, Arno Candel 03-20-14

Intro to Machine Learning with H2O and Python - Denver

Applied Machine learning using H2O, python and R Workshop

MLconf - Distributed Deep Learning for Classification and Regression Problems...

Lessons Learned from Dockerizing Spark Workloads

Turkcell Technology Presentation

Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati,

AdinAction

Hadoop Technology

Data Lakes: 8 Enterprise Data Management Requirements

Big Data Science with H2O in R

H2O Open New York - Keynote, Sri Ambati, CEO H2O.ai

H2O with Erin LeDell at Portland R User Group

H2O World - Advanced Analytics at Macys.com - Daqing Zhao

H2O World - Sparkling Water - Michal Malohlava

Skutil - H2O meets Sklearn - Taylor Smith

Similar to How do you decide where your customer was?

Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Maya Lumbroso

Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Dataconomy Media

Stsg17 speaker yousunjeongYousun Jeong

XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemDan Eaton

Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj

Implementing Big Data at the Speed of BusinessDataWorks Summit

ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato

DATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEIBig Data Week

Searching Chinese Patents Presentation at Enterprise Data WorldOpenSource Connections

Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...Dataconomy Media

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion

ExtraHop Product Overview DatasheetExtraHop Networks

Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackAndrew Yongjoon Kong

Io t world_2016_iot_smart_gateways_moeShawn Moe

Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems

AWS Summit Tel Aviv - Enterprise Track - Data WarehouseAmazon Web Services

IoT Slam Keynote: Harnessing the Flood of Data with Heterogeneous Computing a...Ryft

Splunk App for StreamSplunk

Introduction to SQream and the IoT environmentArnon Shimoni

Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Dataconomy Media

Similar to How do you decide where your customer was? (20)

Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...

Stsg17 speaker yousunjeong

XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem

Hybrid Transactional/Analytics Processing with Spark and IMDGs

Implementing Big Data at the Speed of Business

ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data

DATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEI

Searching Chinese Patents Presentation at Enterprise Data World

Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production

ExtraHop Product Overview Datasheet

Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack

Io t world_2016_iot_smart_gateways_moe

Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre

AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

IoT Slam Keynote: Harnessing the Flood of Data with Heterogeneous Computing a...

Splunk App for Stream

Introduction to SQream and the IoT environment

Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...

Recently uploaded

Slack Application Development 101 Slidespraypatel2

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

How to convert PDF to text with Nanonetsnaman860154

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Recently uploaded (20)

Slack Application Development 101 Slides

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Finology Group – Insurtech Innovation Award 2024

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Boost PC performance: How more available memory can improve productivity

CNv6 Instructor Chapter 6 Quality of Service

The Codex of Business Writing Software for Real-World Solutions 2.pptx

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Automating Google Workspace (GWS) & more with Apps Script

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

How to convert PDF to text with Nanonets

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

08448380779 Call Girls In Friends Colony Women Seeking Men

Salesforce Community Group Quito, Salesforce 101

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

A Domino Admins Adventures (Engage 2024)

Injustice - Developers Among Us (SciFiDevCon 2024)

How do you decide where your customer was?

1. Burak IŞIKLI Erkan HASPULAT How Do You Decide Where Your Customer Was?

2. Who are we? Burak IŞIKLI Senior Software Engineer, Turkcell burak.isikli at turkcell dot com.tr @burakisikli, github.com/burakisikli Erkan HASPULAT Senior Software Engineer, Turkcell erkan.haspulat at turkcell dot com.tr @erkanhaspulat, github.com/ehaspulat

3. Turkcell • It is the first and only Turkish company ever to be listed on the NYSE.

4. Our Hadoop Journey 2013-… 15TB data processing daily 6TB log transferring daily 350 jobs running jobs daily

5. Our Hadoop Journey • ETL  CDR Analysis, Log Processing... • Analytics  Fraud Analytics, Clickstream, Recommendation Engine • Data Lake  Customer Journey

6. Architecture RDBMS Logs Tuna Unix/Internal Tools HADOOP DB Dashboards Archive Ad-Hoc Analysis Mining External Systems Alarm Systems

7. Issues • Default python is 2.6 but Spark Ipython works with Python 2.7+ • Security&Auditing Issues  Copying Data by Masking  Dynamic Data Masking  SOX Compliance

8. Location Analysis Find the subscriber’s location using cell information.  11 billions rows/day  0.5 Tb/day  2.5 hours processing time  Hadoop Streaming w/Perl  Sqoop

9. Recursive Process 1. Subscriber calls 2. XDR is generated 3. FTP/SCP process is started 4. Put into HDFS Dispatcher

10. Dispatcher Why not Flume? Ftp, scp, rsync? Think about 6TB/day Rsync and ftp works serial!

11. Dispatcher Experiment - CDRs Every 15 secs up to 1mb/file, total 10mb gzipped files java.lang.OutOfMemoryError PIG_HEAPSIZE=8000 Failed reallocation of scalar replaced objects JDK-8145996 Img: http://bit.ly/1QSRkGn

12. Location Analysis Join? Perl? Mapper>Header/Trailer JoinColumn-fileName-Rest Img: http://bit.ly/1TrUJhv

13. A tree in the forest Img: http://go.nasa.gov/1SX3wGl

14. Volume May 2014 Jul. 2014 Nov. 2014 1 TB 1.5 TB 2 TB Data size is growing too fast! What about LTE? Data Time Mar. 2016 4.5 TB

15. Adopt the volume

16. Perl Job 212 min 104 min Pig Job 77 min 18 min Upgrade • No space left on disk! • Hadoop upgrade 0.23 -> 1.3.1 -> 2.7.1 • Linear scalability Nodes 1M + 4D 1M + 1SM + 16D Disk 15 TB 698 TB CPU 20 Core 224 Core Memory 1024 GB 1.5TB Version 0.23 HDP 2.3.2

17. Industry-Specific Analysis Competitor Comparison E.g. Shopping center comparison in Istanbul based on city, district, demographic information (age, sex, income, job… etc.)

18. Industry-Specific Analysis But how? Perl? Hive or Pig? What else?

19. Industry-Specific Analysis But how?  Hive external partition ALTER TABLE t1 ADD PARTITION(DAILY_CALENDAR_ID=‘20160101') LOCATION '/user/…/tlc/daily_calendar_id=20160101'"

20. Movement Index Subscribers journeys is provided to determine an analysis with which they transport between cities via signaling data • Airline companies • Bus companies • Local government • Survey companies

21. Movement Index Simple Euclidean Distance Equ: https://en.wikipedia.org/wiki/Euclidean_distance Finding the change of location First, find out the closeness of each cell using coordinates

22. Movement Index Euclidean Distance Hive Query – Cross Join

23. Movement Index Finally, all needs to be done is simple another query Img: http://bit.ly/1MKAiuT

24. Movement Index But one problem! java.io.IOException:java.lang. IllegalArgumentException: Column [daily_calendar_id] was not found in schema! Are you kidding me!! Img: http://bit.ly/1N3QKRQ

25. Movement Index Just another bug: HIVE-11401 Workaround solution: hive.optimize.index.filter=false; Permanent solution: Hive 2.0 Img: http://bit.ly/1W1USZM

26. Is it enough?Img: http://bit.ly/1ZUEsm7

27. Ongoing Projects Movement Predicton Spark ML Real Time Location Analysis Spark Streaming Hadoop on SQL: Spark SQL, Impala… etc. Img: http://bit.ly/25DxPsq

28. Acknowledgements Special thanks to Caner CANAK Uğur Cumhur ÇELİK Img: http://bit.ly/1RVnde7

29. Thank You!

Editor's Notes

Hi, I know it’s one of the latest session and you’re tired I’ll be brief as possible as I can. How do you decide where your customer was? I'm gonna be talking about location analysis and hadoop architecture. Let me begin by telling about outselves and Turkcell.
My name is Burak ISIKLI. I'm a senior software engineer in Turkcell for 7 years and I'm also phd candidate. My colleague name is Erkan Haspulat. He's also a senior software engineer in Turkcell. Unfortunately he's not here today because he broke his leg.
Turkcell is one of the leading telecommunication company in Turkey and second largest telco company in Europe. There are 34.7 mio subs in Turkey and almost 72 mio subscribers in 9 countries. It is the first and only Turkish company ever to be listed on the NYSE.
So let get started with our hadoop journey. It's started in 2013 with relatively small cluster. 15Tb data is processing per day. 6TB data is collecting from source servers, approximetly 350 jobs are running every day.
We've done a lot of things for ETL side, CDR procesing, log analysis, Analytics side fraud analytics..., and also data lake to our customer journey project. In order to do these projects, which Architecture do we used?
This is abstract view of architecture for Turkcell. We're collecting a lot of huge logs. We used unix and internal tools for collecting logs and for extracting data from database Tuna. Tuna is our extraction and loading platform on top of Sqoop. To transformation the data hive, pig, hadoop streaming with perl, and spark is used. After data is processed, either we load into database which is Oracle for us for reporting and dashboards or data is sent to other systems or triggered to some alarm mechanism.
Our system is Teradata Appliance which is included Hortonworks HDP 2.3.2. It may limit you someway for instance the system default python is 2.6 and spark ipython is working with 2.7 or higher. If you do data science things, Spark Ipython help you a lot. We have still security issues because the system has to be SOX compliance. As you can see there are lots of incubating projects such as sentry, recordservice, atlas, ranger and so on but we couldn't find any solution on copying data to development cluster by masking or dynamic data masking. It's still under development.
I present a couple use case and how we do it. First use case is finding the subscriber's location. When your telephone is on, it sends signals including information about which base station is attached even if you're not talking. CDR systems transforms these information into logs. After that only we can do is collect and process them. 11 billions rows, 0.5 tb data is processed per day. It's taking 2.5 hours using hadoop streaming with perl. The output is loaded into database using Tuna. How do we collect these logs?
This is a picture of generating of Call Detail Record. Subscriber calls to someone. System generates XDR or CDR. When a xdr is generated, it's ready to ship so transferring process is started using FTP/SCP and put into HDFS. It's a recursive process in order to catch of the day.
Why don't we use Flume. Flume doesn't allow shipping files unless we install on sth to servers. We can't install anything on serves because its main objective is providing the service or our subscribers may effect on it. So we used unix tools such as ftp, scp, rsync with some modification. I said some modification because rsync and ftp works serials we need to transfer the data as soon as possible. Data is also combined to fit the data blocks.
We've done an experiment. We put 1mb per file into HDFS without combining every 15 seconds. After that we tried to process it but it fails out of memory exception. It's easy right. We increase the heap memory size but it fails again. The message is failed reallocation of scalar replaced objects which related to Java 8 bug. This is why we're still continuing this architecture and on the other side we’re still trying and experimenting the new developments.
How do you join using perl? Joining is like keeping pieces together just like puzzles. In the mapper phase, data is splitted into two parts header and trailer. In the header part, join column and if you're joining more than two files filename. The rest of file is in the trailer part. That's all.
After we've done this project, it got a lot attention used a lot of projects. We realized that it's like a tree in the forest. There may be a lot of other coming projects
But it's not only project working on this cluster. The data size on this cluster is growing too fast. Just a real world example, within six months signalling data we called sentinel logs grows 1TB to 2TB in daily basis. Currently the size 4.5tb. Furthermore we're not using LTE.
I came from Istanbul. Our nightmare is crossing the continent. As you can see in the picture why I’m saying that. There are lots of options crossing the continent by bus, by car, by ferry, by tram etc. People who lives in Istanbul decides which is the best way to go. Because there is bottleneck in the bridges and bridges can’t handle these crowded people. So we also must be adapted to this volume without creating a bottleneck.
In old days we still had a relatively small cluster. Moreover we've seen no space left on disk error! We bought a brand new cluster. We left the old system which has 4 data nodes and a master node, upgrade it to new system. New system is a little bit bigger which has 16 data nodes 1 master node and 1 secondary node. The main pros in this system is seeing the linear scability. Our jobs has improved. For instance a perl job is taking 212 minutes to 104 minutes.
But new projects keep coming. Next scen is industry-spesific analysis. It's a competitor analysis. There are lots of shopping centers in Istanbul. Customers are viewing their point of view based on age, income, job, sex and so on. A pretty print screen is on left of the slide.
But how should we implement it? As you can imagine, this is urgent request. It has to be done as soon as possible. But output is on the hdfs file. If we do it using perl, it takes time. Do we need to implement the project we've already implemented before using hive or pig? It still takes more time.
Solution comes with hive external partition. We create a table with external partition pointing the output directory.
Next use case is movement index. In this project subscribers journeys is provided to determine an analysis with which they transport between cities via signaling data. In brief summary we need to determine change of city. Product is used by Airline companies, Bus companies, Local government, Survey companies.
We know the base station's coordinates. It can be found closeness of each other using simple euclidean distance. For those who do not remember, the formula is on the slide.
It's simple cross join. You can see the hive query.
Finally, all needs to be done is simple another query right? Everything will be fine, what do you think?
But there is a problem. Hive doesn't find the partition column. But, it's there. What the hell? Are you kidding me?
When you google it, it turns out that is bug related to parquet format. Workaround solution is easy, setting index filter to false. But we can't use file format indexes when it's turned off. Permanent solution is in hive 2.0 so that we had to use it until the next upgrade.
Is it enough? Hell no, there are lots of ongoing projects.
Movement prediction which is predicting the subscriber's location. Real time location analysis with historic data. Customers want to query the subscriber’s who is in Shopping Mall 15 minutes and also have been before about 1 month ago. We're trying Hadoop on SQL tools such as Spark SQL, Impala for ad-hoc query.
This is a teamwork. I thank my other collogues Caner and Uğur for their help.
That's all. Thank you very much for listening to me.

How do you decide where your customer was?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How do you decide where your customer was?

Similar to How do you decide where your customer was? (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

How do you decide where your customer was?

Editor's Notes