Understanding Big Data summarizes big data and popular big data technologies. It discusses how big data is generated from various sources and is too large to be processed by traditional databases. Popular technologies like Hadoop, HDFS, MapReduce, Hive, Pig, HBase, and Mahout are able to collect, store, process, and analyze big data. Companies are using big data to gain insights from customer data, optimize operations, prevent fraud, and make recommendations.
2. First of All What is Big Data?
• BIG-DATA is collected as
non-structural from
different sources Such as
Social media sharing,
network logs, blogs,
photos, videos, log files,
etc..
• BIG-DATA is Can not be
analyzed before Enormous
size and / or diversity
information over the data
3. 1) Why Big Data ? How did we come this point?
2) What is the Big Data’s Components?
3) How will the integration with existing systems?
4) Is there any BIG Data Platform / Applications?
4. • Almost all data scientists believed that too important advantages of
companies using with the ability to analyze information collected from all
sources .
• For example, retailers; according to scientists analyze they can through data
can improve their operations by 60 percent profit margin. Similar rates are
also valid for the public sector.
• According to research, The US healthcare using to data scientists can save
300 million dollars a year.
• According to estimates in 2020, from household appliances to cars and
phones and about 50 billion devices will produce data and will be silently
communicate with each other.
• For companies which wish to take decisions in the forward-looking ,
predictions should be critically correct from these «Big Data».
5. ● Energy companies, using smart grids and meters, consisting of the use
of data relating to individual subscribers, store, handle the event.
● Banks has become 7/24 branch as according to the information they
store regarding collecting customers , recognizing the user, the
Internet branch knows the day nor to enter and accordingly the main
page menu that makes the most efficient, customers who reminders,
offers customizable interfaces, rich, fast and convenient branch .
● Hospitals saving datas on their databases for effective to provide
medical services
● This information will be stored as "Big Data"
7. Google without using conventional methods, created their technologies of
requirenments,improving itself was a success. Google has billions of web pages
on the Google File System keeps uses Big Table in the database, using
MapReduce for processing Big Data.
See http://www.google.com/about/datacenters
Google and Amazon publishes academic articles are related to their work. Some
developers who inspired by the articles such as Doug Cutting created similar
technology as secret. These are usually the most beautiful examples of the
Apache Lucene, Left, Hadoop, HBase such projects. Each of these projects can
successfully use the Big Datas .
The second generation of the companies such as Facebook, Twitter, Linkedin,
are going a step forward by publishing them store it as open-source projects
developed for Big data. Cassandra, Hive, Pig, Voldemort, Storm, indextank
projects are examples.
8. Peki Büyük Veri nasıl depolanacak?
Nasıl işlenecek?
Nasıl analiz edilecek?
9. Volume
of Tweets
create daily.
12+ terabytes
Variety
of different
types of data.
100’s
Veracity
decision makers trust
their information.
Only 1 in 3
4 «components» for being Big
trade events
per second.
5+ million
Velocity
10.
11. • It is expecting to reach 3 billion users.In August
2015,
12TB / day «log data» is producing.
The end of 2014 expected reach is 500+ million
users.
160 million users are online.
•
•
• 100 million active users. 12+ TB of data
tweets / day! ..
Social Networks and Social Work
12. Google process 24
Petabytes data every day
4.6 Billion mobile
phone exist
2 Billion Internet users annual traffic
in 2014 equals 667 Exabytes
Social Networks and Social Work
15. “Data generated by machines and sensors
will exceed that generated by social media by
at least a factor of 10.” *
Leon Katsnelson
Program Director, Big Data & Cloud
Computing IBM
16. By the way, We don't have enough space to store all this data!
17. We actually appear as a game company.In fact we are
data analysis company..
Ken Rudin, Zynga VP of Analytics
• Offers completely free game facilities.
• Gaining revenue by selling virtual goods.
• The monthly average has 232m active users.
• 95% of players never visited shop!
• With Using Big Data analysis they disturbed to the game world.
18. Four Entry Points of Big Data
Unlock Big
Data
Simplify Your
Warehouse
Preprocess
Raw Data
Analyse
Streaming
Data
IBM Big Data Platform
Systems
Management
Application
Development
Visualization
& Discovery
Accelerators
Information Integration & Governance
Hadoop
System
Stream
Computing
Data Warehouse
BI / Reporting Exploration /
Visualization
Functional
App
Industry
App
Predictive
Analytics
Content
Analytics
Analytic Applications
19. Applications for Big Data Analytics
Homeland Security
FinanceSmarter Healthcare Multi-channel sales
Telecom
Manufacturing
Traffic Control
Trading Analytics Fraud and Risk
Log Analysis
Search Quality
Retail: Churn, NBO
20. "Companies which give importance to Social media is gaining"
In Turkey Organized by the Teradata "Big Data: Great Opportunity" themed event brought together senior
executives of the company in Istanbul. President of Teradata EMEA Hermann Wimmer, ‘companies that want to
increase their profitability by preventing competitors from social media by analyzing the data obtained in a short
time, said he helped to make quick decisions.’
23. Big Data and Open Source Nested
• Open Source Community contribution made over the years
- Apache Hadoop ve Jaql, Apache Derby, Apache Geronimo, Apache
Jakarta
- Eclipse: was founded by IBM.
- Lucene ;IBM Lucene Extension Library (ILEL)
- DRDA, XQuery, SQL, XML4J, XERCES, HTTP,
Java, Linux...
• Open source IBM Softwares
– WebSphere: Apache
– Rational: Eclipse and Apache
– InfoSphere: Eclipse and Apache
• IBM’s BigInsights (Hadoop) is %100 open source software.
24. Hadoop
• Open Source
• Distributed
Computing
• Very Simply
MapReduce
• Connected
computers
27. HDFS
• Google File System
• Distributed Disk File System
• File Clustering(Usually Files sizes are over then GBs
• Fault tolerant
• Replication
• HDFS is knowledge about the location of physical.
28. HDFS
• Accessible by with Hadoop Shell , Java API or Web UI
• Four application node:
– NameNode – manages the metadata of the file
– SystemJob Tracker – MapReduce
– Task Tracker – MapReduce
– Data Node – InformationshidingwithNameNode
29. BIG DATA is not just HADOOP
Manage & store huge volume
of any data
Hadoop File System
MapReduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data WarehousingStructure and control data
Integrate and govern all
data sources
Integration, Data Quality, Security, Lifecycle
Management, MDM
Understand and navigate
federated big data sources
Federated Discovery and Navigation
30. MapReduce
• Big Data Processing -‐Google
• Distributed Calculation Model
• Key - the value binary data processing
• Easy programming framework
– for use:you should improve map() ve reduce()
functions
31. Map: First Step
• Make ready the next processing elements
while matching with a key
– Data Cleaning
– Simply Calculating
– Split Strings
32. Reduce: Last Step
• Takes the list of values for Same key with
Iterators
– Filtering
– Combining
– Samping
So reduced..
The result is written to HDFS or HBase
34. Apache Pig
• Yahoo!
• Designed to easy analysis of large data sets
• Easy than Map Reduce function in Java
– Pig Lan coding
– Can be improved
• Similar with each languages
• 10 lines of Pig code may be equivalent to hundreds of lines Java
35. Run To Pig !
• Grunt – Shell
• Java interface
• Eclipse and IntelliJ IDEA plugins
%tweets = load ‘/today/tweets’ as (user,
mention, tweet)
%twitters = group tweets by mention
36. Apache Hive
• Data Warehouse Project for Hadoop
• SummarizingData
• Instant queries
• SQL-like language –HiveQL
– Allows to define custom mapper and reducer
Special Note * : Hive compiler translates the
MapReduce operations to SQL queries
37. HBase
Every time we do not need to relational databases
We need Scalability
Table size can be very large so We want very fast
access
Distributed Key-‐Sorted Persistent Map
38. HBase
Google – Big Table clon
Works on HDFS
*fault tolerance
*scalability
*MapReduces input-output
Hbase = HDFS + Random read/ write
39. HBase
Where to Use:
Social Media
Recommended Systems
Search Engines
Intelligence and Monitoring Services
Financial Systems - fraud
40. Apache Mahout
• Let us go beyond the simple analysis
• Classification - Email Spam - Call Center
• Clustering - finding new news
• Recommended Systems
45. IBM Big Data Platform
IBM Big Data & Netezza Product Group
InfoSphere BigInsights Hadoop-based,
low latency, diverse and high-volume data
analysis
Hadoop
IBM Netezza High
Capacity Appliance
Archived questionable
structural data
IBM Netezza 1000
BI+Ad Hoc
Structured Data Analysis
IBM Smart
Analytics System
Structural analysis
of operational data
IBM Informix
Timeseries
Time-structured analytics
IBM
InfoSphere
Warehouse
High volume, structural
veri analizi
Stream Computing
InfoSphere Streams Fluid
analysis for low latency data
MPP Data Warehouse
Information
IntegrationInfoSphere Information
Server High volume data
integration and transformation
46. Big Data Exploration: Value & Diagram
File
Systems
Relational
Data
Content
Management
Email
CRM
Supply
Chain
ERP
RSS Feeds
Cloud
Custom
Sources
DataExplorerApplication/
Users
Find, Visualize & Understand all
big data to improve business
knowledge
• Greater efficiencies in business
processes
• New insights from combining and
analyzing data types in new ways
• Develop new business models with
resulting increased market presence
and revenue
47. Data Analysis in Different Diversity
Making the analysis on the data in mixed feature.
Dinamic Data Analysis
High volume flow data, ad-hoc analysis
Explore and experiment
Data on Ad-hoc analysis, discovery and
inspection data
Manage and Plans
Data Rules,Data integrity checking and
application
Very High Volume Data Analysis
The data in the PB scale appropriate price /
performance criteria for analysis
What does IBM platform?