Mais conteúdo relacionado Semelhante a Big data presentation (2014) (20) Big data presentation (2014)1. © 2014 IBM Corporation1
Big Data
Xavier Constant
xavier.constant@es.ibm.com
Lecture at EADA
International Master in Marketing (2014)
2. © 2014 IBM Corporation2
Big Data Concepts
Big Data Technology
Data Scientists
3. © 2014 IBM Corporation3
Traditional DW
BI
Server
ERP
CRM
Data
Marts
Reports /
Dashboards
Operational
System
ETL ETL
BENEFITS:
Mature Technology
SQL Language (declarative, non technical)
Skills & resources availablity (programmers, DBAs,…)
LIMITATIONS:
Big operational data volumes
Queries take too long or don’t even finish
Admin complexity (partitions, archiving,…)
New data types
Free text, images, video, audio,…
Data in real time (sensors, logs, geospatial data, etc…)
New analysis types
Exploratory
Predictive
Flat files,
Spread
sheets
Data
Warehouse(s)
4. © 2014 IBM Corporation4
1 in 2
business leaders
don’t have access to
data they need
83%
of CIO’s cited BI and
analytics as part of their
visionary plan
5.4X
more likely that top
performers use
business analytics
80%
of the world’s
data today is
unstructured
90%
of the world’s
data was created
in the last two
years
20%
of available data can
be processed by
traditional systems
Source: GigaOM, Software Group, IBM Institute for Business Value"
Intrinsic Property of Data … it grows
5. © 2014 IBM Corporation5
Characteristics of Big Data
Velocity is the game changer: It’s NOT just how
fast data is produced or changed, BUT the
speed at which it must be analyzed
received, understood, and processed.
6. © 2014 IBM Corporation6
Paradigm shifts enabled by big data I
Leverage more of the data being captured
7. © 2014 IBM Corporation7
Paradigm shifts enabled by big data I
Leverage more of the data being captured
Bank X
8. © 2014 IBM Corporation8
Paradigm shifts enabled by big data II
Reduce effort required to leverage data
9. © 2014 IBM Corporation9
Paradigm shifts enabled by big data II
Reduce effort required to leverage data
10. © 2014 IBM Corporation10
Paradigm shifts enabled by big data III
Data leads the way – and sometimes correlations are good enough
11. © 2014 IBM Corporation11
Paradigm shifts enabled by big data III
Data leads the way – and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
12. © 2014 IBM Corporation12
Paradigm shifts enabled by big data III
Data leads the way – and sometimes correlations are good enough
13. © 2014 IBM Corporation13
Paradigm shifts enabled by big data IV
Leverage data as it is captured
14. © 2014 IBM Corporation14
Paradigm shifts enabled by big data IV
Leverage data as it is captured
15. © 2014 IBM Corporation15
Complementary Analytics
Traditional Approach
Structured, analytical, logical
New Approach
Creative, holistic thought, intuition
Multimedia
Data
Warehouse
Web Logs
Social Data
Sensor data:
images
RFID
Internal App
Data
Transaction
Data
Mainframe
Data
OLTP System
Data
Traditional
databases
ERP
Data
Structured
Repeatable
Linear
Unstructured
Exploratory
Dynamic
Text Data:
emails
Hadoop and
Streams
New
Sources
16. © 2014 IBM Corporation16
Types of Analytic Tools
17. © 2014 IBM Corporation17
Organisations are prioritising internal data sources
Untapped stores of internal data
Size and scope of some internal data, such as
detailed transactions and operational log data,
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected, but not
analyzed, for years
Focus on customer insights
Customers – influenced by digital experiences
– often expect information provided to an
organization will then be “known” during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions, Emails, Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization.
18. © 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity, respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100% due to rounding
19. © 2014 IBM Corporation19
Hadoop workloads
92%
92%
83%
58%
42%
25%
58%
92%
92%
92%
67%
67%
67%
83%
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop. BI
Leadership Forum, April, 2012
20. © 2014 IBM Corporation20
Big Data Exploration
Find, visualize, understand all
big data to improve decision
making
Enhanced 360o View
of the Customer
Extend existing customer views
(MDM, CRM, etc) by
incorporating additional
internal and external
information sources
Operations Analysis
Analyze a variety of machine
data for improved business results
Data Warehouse Modernization
Integrate big data and data warehouse
capabilities to increase operational efficiency
Security/Intelligence
Extension
Lower risk, detect fraud and
monitor cyber security in
real-time
Key Big Data Use Cases
21. © 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
22. © 2014 IBM Corporation22
Solution for Big Data
Rest Data:
– Data to analyze are already stored (structured and unstructured)
– Examples: logs, facebook, twitter, etc.
– Solution: Hadoop (open source)
Data in motion:
– Data are analyzed in real time, just in the moment they are generated.
They are analyzed with any previous storage
– Examples: Sensors, RFID, etc.
– Solution: Streams / CEP solutions
23. © 2014 IBM Corporation23
Hardware improvements through the years...
CPU Speeds:
– 1990 - 44 MIPS at 40 MHz
– 2000 - 3,561 MIPS at 1.2 GHz
– 2010 - 147,600 MIPS at 3.3 GHz
RAM Memory
– 1990 – 640K conventional memory (256K extended memory recommended)
– 2000 – 64MB memory
– 2010 - 8-32GB (and more)
Disk Capacity
– 1990 – 20MB
– 2000 - 1GB
– 2010 – 1TB
Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years,
currently around 70 – 80MB / sec
24. © 2014 IBM Corporation24
How long it will take to read 1TB of data?
1TB (at 80Mb / sec):
– 1 disk - 3.4 hours
– 10 disks - 20 min
– 100 disks - 2 min
– 1000 disks - 12 sec
25. © 2014 IBM Corporation25
Parallel Data Processing is the answer!
It was with us for a while:
– GRID computing - spreads processing load
– Distributed workload - hard to manage applications, overhead on
developer
– Parallel databases – DB2 DPF, Teradata, Netezza, etc (distribute the
data)
26. © 2014 IBM Corporation26
What is Apache Hadoop?
Apache Open source software framework.
Flexible, enterprise-class support for processing large volumes of
data
– Inspired by Google technologies (MapReduce, GFS, BigTable, …)
– Initiated at Yahoo
• Originally built to address scalability problems of Nutch, an open source Web search
technology
– Well-suited to batch-oriented, read-intensive applications
– Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel, cost effective manner
– CPU + local disks = “node”
– Nodes can be combined into clusters
– New nodes can be added as needed without changing
• Data formats
• How data is loaded
• How jobs are written
27. © 2014 IBM Corporation27
Design principles of Hadoop
New way of storing and processing the data:
– Let system handle most of the issues automatically:
• Failures
• Scalability
• Reduce communications
• Distribute data and processing power to where the data is
• Make parallelism part of operating system
• Meant for heterogeneous commodity hardware
Bring processing to Data!
Hadoop = HDFS + MapReduce infrastructure
Optimized to handle
– Massive amounts of data through parallelism
– A variety of data (structured, unstructured, semi-structured)
– Using inexpensive commodity hardware
Reliability provided through replication
28. © 2014 IBM Corporation28
What is the Hadoop Distributed File System?
Driving principals
– Data is stored across the entire cluster (multiple nodes)
– Programs are brought to the data, not the data to the program
– Follows the Divide and Conquer paradigm.
Data is stored across the entire cluster (the DFS)
– The entire cluster participates in the file system
– Blocks of a single file are distributed across the cluster
– A given block is typically replicated as well for resiliency
1011010
0101001
0011100
1111110
0101001
1101001
0100101
1001001
0101001
1000101
0010111
0101110
1011110
1101101
0101101
0010101
0010101
0101011
1001001
1010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
29. © 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1. Map Phase
(break job into small parts)
2. Shuffle
(transfer interim output
for final processing)
3. Reduce Phase
(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text val, Context
StringTokenizer itr =
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
public void reduce(Text key,
Iterable<IntWritable> val, Context context){
int sum = 0;
for (IntWritable v : val) {
sum += v.get();
. . .
Distribute map
tasks to cluster
Hadoop Data Nodes
30. © 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output):
< Bye, 1>
< IBM, 1>
< Hello, 2>
< World, 2>
Map 1:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
Count number of word's occurrences
Map 2:
< Hello, 1>
< IBM, 1>
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
31. © 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
It's not just runtime. Development phase has to be taken into
account.
Although the Hadoop framework is implemented in Java,
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model, a few
application development languages have emerged that build on top
of Hadoop:
– Pig
– Hive
– Jaql
– ... Jaql
32. © 2014 IBM Corporation32
Pig, Hive, Jaql – Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
reads/writes or low-latency queries
33. Pig, Hive, Jaql – Differences
Characteristic Pig Hive Jaql
Developed by Yahoo! Facebook IBM
Language Pig Latin HiveQL Jaql
Type of
language Data flow Declarative (SQL
dialect) Data flow
Data structures
supported Complex Better suited for
structured data
JSON, semi
structured
Schema Optional Not optional Optional
35. © 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization & Discovery
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine &
Extractor Library)
BigSheets JDBC
Applications & Development
Text Analytics MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard &
Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode
High Avail
Avro
36. © 2014 IBM Corporation36
Open Source frameworks I
Avro: A data serialization system that includes a schema within each file. A schema defines the data types that are
contained within a file, and is validated as the data is written to the file using the Avro APIs. Users can include primary data
types and complex type definitions within a schema.
Flume: A distributed, reliable, and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase: A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets. Unlike relational database systems, HBase does not support a structured query language like SQL. HBase applications
are written in Java™, much like a typical MapReduce application. HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together. This approach is different from a row-oriented
relational database, where all columns of a row are stored together
HCatalog: A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored. You can change how you write data, while still supporting existing data in
older formats. HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive: A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations, in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS). SQL developers write statements,
which are broken down by the Hive service into MapReduce jobs, and then run across a Hadoop cluster. InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software.
37. © 2014 IBM Corporation37
Open Source frameworks II
Lucene: A high-performance text search engine library that is written entirely in Java. When you search within a
collection of text, Lucene breaks the documents into text fields and builds an index from them. The index is the key
component of Lucene that forms the basis of rapid text search capabilities. You use the searching methods within the Lucene
libraries to find text components. With InfoSphere BigInsights, Lucene is integrated into Jaql, providing the ability to build,
scan, and query Lucene indexes
Oozie: A management application that simplifies workflow and coordination between MapReduce jobs. Oozie provides
users with the ability to define actions and dependencies between actions. Oozie then schedules actions to run when the
required dependencies are met. Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system.
R: A Project for Statistical Computing
Scoop: A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster. You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses.
Zookeeper: A centralized infrastructure and set of services that enable synchronization across a cluster. ZooKeeper
maintains common objects that are needed in large cluster environments, such as configuration information, distributed
synchronization, and group services.
38. © 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard &
Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine &
Extractor Library)
JDBC
Applications & Development
Text Analytics MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode
High Avail
Avro
Visualization & Discovery
BigSheets
39. © 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates Map/Reduce Jobs
working over Hadoop Big Data.
Helps non-programmers to work with Hadoop cluster.
User models their big data as familiar spreadsheet-like tabular data
structures (collections). Once data is represented in a collection,
business analysts can filter and enrich its content using built-in
functions and macros. Furthermore, analysts can combine data
residing in different collections as well as generate charts and new
“sheets” (collections) to visualize their data. They can even export
data into a variety of common formats with a click of a button.
Much of the technology included in Sheets was derived from the
BigSheets project of IBM’s Emerging Technologies team.
40. © 2014 IBM Corporation40
BigSheets: Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console – e.g., file
system data, output from Web crawl, etc.
41. © 2014 IBM Corporation41
Big Sheets: Collection Operations
Work with built-in “sheets” editor
Add / delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
…………..
42. © 2014 IBM Corporation42
BigSheets: Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts, bar charts, tag clouds, maps, etc.
Hover over sections to reveal details
43. © 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard &
Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine &
Extractor Library)
JDBC
Applications & Development
Text Analytics MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
NameNode
High Avail
Avro
Visualization & Discovery
BigSheets
Big SQL
44. © 2014 IBM Corporation44
What is Big SQL?
Big SQL brings robust SQL support to the Hadoop ecosystem
– Scalable server architecture
– Comprehensive SQL'92 ansi support
– Standards compliant client drivers (JDBC & ODBC)
– Efficient handling of "point queries"
– Wide variety of data sources and file formats
– Extensive HBase focus
– Open source interoperability
Our driving design goals
– Existing queries should run with no or few modifications
– Existing JDBC and ODBC compliant tools should continue to function
– Queries should be executed as efficiently as the chosen storage
mechanisms allow
45. © 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastore
– Each can query the others tables
SQL engine analyzes incoming
queries
– Separates portion(s) to execute at
the server vs. portion(s) to execute
on the cluster
– Re-writes query if necessary for
improved performance
– Determines appropriate storage
handler for data
– Produces execution plan
– Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only!
Application
SQL Language
JDBC / ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
Files
HBase RDBMS •••
•••
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
•••
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
46. © 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard &
Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine &
Extractor Library)
JDBC
Applications & Development
MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
NameNode
High Avail
Avro
Visualization & Discovery
BigSheets
Big SQL
Text Analytics
47. © 2014 IBM Corporation47
What is Text Analytics?
High Performance and Scalable rule based Information Extraction Engine.
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build, test, and refine rules.
– Developer tools, an easy to use text analytics language, and a set of
extractors for fast adoption.
– Multilingual support, including support for DBCS languages.
Developed at IBM Research since 2004: System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
48. © 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics.
SQL Like Language.
Fully declarative text analytics language.
Once compiled produced an AOG plan to work in the data.
No “black boxes” or modules that can’t be customized.
Tooling for easy customization because you are abstracted from the
programmatic details.
Competing solutions make use of locked up black-box modules that cannot be
customized, which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern <N.match> <U.match>
as match
from Number N, Unit U;
49. © 2014 IBM Corporation49
Text Analytic: Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010, one team distinguished well
from the rest winning the final. Early in the second
half, Netherlands’ striker, Arjen Robben, had a chance
to score, but the awesome keeper for Spain, Iker
Casillas made the save. Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win.
50. © 2014 IBM Corporation50
50
Text Analytic: Real Example
51. © 2014 IBM Corporation51
51
One step beyond: Watson
52. © 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard &
Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine &
Extractor Library)
JDBC
Applications & Development
MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
Guardium
Platform
Computing
Cognos
IBMOpen Source
GPFS-FPO
NameNode
High Avail
Avro
Visualization & Discovery
BigSheets
Big SQL
Text Analytics
R
53. © 2014 IBM Corporation53
Big R
• Explore, visualize, transform,
and model big data using
familiar R syntax and
paradigm
• Scale out R with MR
programming
– Partitioning of large data
– Parallel cluster execution of R
code
• Distributed Machine
Learning
– A scalable statistics engine that
provides canned algorithms, and
an ability to author new ones, all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or, push R
functions
right on the
data
1
2
3
54. © 2014 IBM Corporation54
Where Does BigData Fit?
Analytical
database
(DW)
Source
Systems
Analytical
tools
5. Explore data
6. Parse, aggregate
“Capture in case
it’s needed”
1. Extract, transform, load
“Capture only what’s
needed”
9. Report and mine data
55. © 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
56. © 2014 IBM Corporation56
Data scientist – The new cool guy in town
Article in Fortune “The unemployment rate in
the U.S. continues to be abysmal (9.1% in
July), but the tech world has spawned a
new kind of highly skilled, nerdy-cool job
that companies are scrambling to fill: data
scientist”
McKinsey Global Institute “Big data Report”
By 2018, the United States alone could
face a shortage of 140,000 to 190,000
people with deep analytical skills as well as
1.5 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
57. © 2014 IBM Corporation57
Data Science is Multidisciplinary
58. © 2014 IBM Corporation58
Successful Data Scientist Characteristics
59. © 2014 IBM Corporation59
Data Scientist Qualities
60. © 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist?
63. © 2014 IBM Corporation63 © 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
– Understanding Big Data – Free PDF Book
• http://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDF
– Developing, publishing, and deploying your first big data application with InfoSphere
BigInsights
• www.ibm.com/developerworks/data/library/techarticle/dm-1209bigdatabiginsights/index.html
– Implementing IBM InfoSphere BigInsights on System x - Redbook
• http://www.redbooks.ibm.com/redpieces/abstracts/sg248077.html
Resources
– Big Data Information Center
• www-01.ibm.com/software/ebusiness/jstart/bigdata/infocenter.html
– InfoSphere BigInsights
• www-01.ibm.com/software/data/infosphere/biginsights/
– Stream Computing
• www-01.ibm.com/software/data/infosphere/stream-computing/
– DeveloperWorks, forums, demos ...
• http://www.ibm.com/developerworks/wiki/biginsights/
64. © 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversity.com
Flexible on-line delivery
allows learning @your place
and @your pace
Free courses, free study
materials.
Cloud-based sandbox for
exercises – zero setup
Robust Course
Management System and
Content Distribution
infrastructure
Notas do Editor Sistemas transaccionales maduros, con años de historia, los datos van creciendo. De Gigabytes a Terabytes
Obstancles- Latency, user concurrency, single threaded
How do this without having companies hire specialists who know how to query Hadoop using Java or overcome latency.
Latency: via Hcatalog –
Query: Better interfaces
Won’t fix things like user concurrency –
THIS IS ASPIRATION BUT LOTS OF OBSTACLES PREVENTING –
- Latency via batch, user concurrency cause no workload mgmt or prioritization or query optimizer
Know coding
Skills, or more accurately a shortage of skills, is widely recognized as the leading inhibitor to the broader acceptance of Big Data solutions in the enterprise. To address the skills shortage IBM has sponsored community-driven effort to deliver Big Data education regardless of physical location or budget. We call this @your place, @your pace education and it has turned out to be a huge success with well over 8000 registered students and over 1800 students enrolled in the Hadoop Fundamentals class alone. We have seen people gain sufficient skills in a matter of a week to enter and complete the Hadoop Programming Challenge by submitting very innovative solutions. IBM’s sponsorship provides BigDataUniversity.com participants with a comprehensive set of education materials, access to free products for hands on labs and a cloud-based course management system to make the learning process easy and fun.