50 Shades of SQL

Nicholas Berg, Seagate Technology
Adriana Zubiri, IBM
50 Shades of SQL

A New Seagate
HYBRID SOLUTIONS
HDD FLASH
SILICON
BRANDED
SYSTEMS
SEAGATE is in a unique position to CREATE EVEN MORE VALUE
for our customers by integrating our 35+ years of storage expertise in
HDD with FLASH, SYSTEMS, SERVICES AND CONSUMER DEVICES
to deliver unique solutions that enable our customers to
ENJOY AND GET VALUE FROM THEIR DATA
more than ever before.

• $14B Annual Revenue
• 2 billion drives shipped
• 52,000 Employees, 26 countries
• 9 Manufacturing plants: US, China, Malaysia,
N.Ireland, Singapore, Thailand
• 5 Design centers: US, Singapore, South Korea
• Vertically integrated factories from Silicon fabrication
to Drive assembly
SYSTEMSHD FLASH SILICON
PREMIUM BRANDED
SOLUTIONS
HYBRID SOLUTIONSSYSTEMSFLASH BRANDEDHYBRID SOLUTIONS
LSI

4
• Experimented with text analysis of Call Center logs
• Proved out the use case, but Big Data text analytics built into Call Center support
applications met the need without in-house costs
• Marketing organization had some social media Big Data Use cases
• These are being met by companies specializing in this kind of Big Data analysis
• Reviewed other potential use cases such as:
• Mining data center support, performance and maintenance logs
• Mining large data sets for eSecurity
• Tested loading up some volume factory test log data and run some analytics
• Compelling use case for Hadoop: Deeper and wider analysis of Factory and Field data
Where to start with Hadoop - find a use case

5
Traditional Data Architecture Pressured
4.4 ZB in 2013
85% from New Data Types
15x Machine Data by 2020
44 ZB by 2020
ZB = 1B TB

6
• Enterprise Hadoop cluster as extension of EDW
• Ability to store and analyze 10x-20x Factory and Field data
• Much longer retention of relevant manufacturing data
• Multi-purpose analytic environment supporting hundreds of potential users across
Engineering, Manufacturing and Quality
• Possible local factory Hadoop clusters for special-purpose processing
• Eventual integration across multiple clusters and sites
• At a high level, Hadoop will enable us to
• Ask questions we could never ask before...
• About data volumes we could never collect and store before…
• Doing analysis we could never perform in reasonable time before...
• And connecting data that could never before be retained for combined analysis
Seagate’s High-level Plans for Hadoop

7
• When we first started our Hadoop journey, MapReduce was the main way to
access and query HDFS data
• Two years on, the Hadoop world has changed with SQL being a major force in
Hadoop (Hive, Impala, BigSQL)
• SQL on Hadoop helps address three main Hadoop challenges:
• Addresses a skills gap: Hadoop MapReduce needs Java coders vs. using existing
SQL skills
• SQL provides integration with existing environments and tools (i.e. databases and BI
tools)
• Enables Hadoop to move from batch processing to interactive analysis
• New memory based Apache projects are being developed that allow for even
faster interactive analysis like Apache Spark but SQL is still core to these too
Hadoop a dynamic ever changing landscape

8
• We put together a five year Big Data vision statement and strategy plan
• Socialized strategy plan for feedback
• Decided to conducted a large scale Hadoop pilot
• We wanted to really understand what Hadoop’s real capabilities and potential were
• Purchased 60 node cluster (3 management nodes, 57 data nodes)
• Performed an analysis on which Hadoop distribution to use
• Defined what use cases to run in our large scale pilot
Big data for the enterprise

9
• Two main flavors: open source oriented or more proprietary distributions
• We believe the more open source oriented solutions are the most beneficial
because:
• More portable - you can more easily move your Hadoop cluster from vendor to vendor
• Avoids vendor lock into expensive and proprietary technology
• Open source projects ensure interoperability with other open source projects
• Other important considerations:
• Integration with RDMSs, BI solutions and other platforms
• R&D investment and support capability
• Consulting and training
• Seagate chose IBM because we believe they currently have the most advanced
SQL “add-on” capability plus some other good tools with solid support services
Choosing a Hadoop software distribution

10
• A Logical Data Warehouse combines traditional data warehouses with big data
systems to evolve your analytics capabilities beyond where you are today
• Hadoop does not replace your EDW. EDW is a good “general purpose” data
management solution for integrating and conforming enterprise data to produce
your everyday business analytics
• A typical EDW may have 100’s of data feeds, dozens of integrated applications
and run 1000’s to 100,000’s of queries a day
• Hadoop is more specialized and much less mature. For now it will have only a
few application integration points and run fewer queries at a lower concurrency
for answering different questions
• A Hadoop cluster of 100 nodes is a supercomputer. What would you use a
supercomputer for? Probably to answer the really big questions
Evolving to a Logical Data Warehouse

11
• Incremental phased delivery, or use case by use case
• Form a “data lake” or “data reservoir” for all enterprise data
• Data availability must come first, model and transform the data in place within
Hadoop
• resist moving the data again
• Lots of talk about schema on read but for DW types of uses, this is impractical
• Data modeling is still required but can be simplified
• Have multiple clusters: Development, Test and then two or more Production,
one for Ad Hoc data exploration & experimentation, one for more governed uses
• Use existing custom query/analytics solution to provide “transparent” access to
Hadoop
Some early practices and learnings

12
• Use partitioned/tiered data sets: raw, modeled/standardized, analytics,
history/archive
• Tier 1: low latency raw data for power users to access using low level tooling (MR,
Python)
• Tier 2: de-duped, modeled and transformed data used by the majority of Hadoop
users
• Tier 3: specialized analytic data sets for specific needs (e.g. data pivots, aggregations)
• Tier 4: extended history/archives (maybe)
• Copy summarized data, derived analytics to EDW for broader use/analysis with
BI tools
• Do lots of performance testing, run benchmarks, continually optimize
Some early practices and learnings

14
• Fragmented Hadoop development community, lack of standards on an open
platform
• Vendor solutions interoperability
• Knowing which Hadoop projects to “bet on”, which data formats and
compression
• Speed of change: probably has more code been written than any other IT
platform
• Need to upgrade cluster software frequently (once a quarter)
• Immaturity: Some things not ready row/column security, real-time queries
• Resource management for different types of applications and workload
Hadoop challenges – an emerging and evolving platform

15
• Lack of BI tools that can really take advantage of huge data sets and visualize
them
• Hive tables cannot be updated directly and is not ACID compliant*
• Still very batch processing orientated but interactive is gaining traction with
Spark etc.
• Provisioning large numbers of machines, dealing with hardware failures
• Integrating remote clusters, cross cluster data movement and inter-cluster
processing
*ACID - a set of properties that guarantee that database transactions are processed reliably
Hadoop challenges – an emerging and evolving platform

16
• Completely new and an awful lot to learn, design & implementation are huge
tasks
• Hadoop is immature lacks robustness: instability, buggy, new code released too
early
• Speed of change: management need to understand that plans will be dynamic
and change with the evolving technology
• Have less formal schedules, manage expectations to the low side
• It’s hard to move fast enough – management, power users and analysts are impatient
Hadoop challenges – setting expectations

17
• Be flexible and adaptable as technology changes and matures
• Experiment, fail fast, learn and move on
• Be ready to change and adapt to new technology or if support dries up on a Hadoop
project
• Developing IT skills quickly
• Finding experienced and talented Hadoop consultants, do lots of knowledge share
• Keeping up with the data scientists
• Convincing security and data centers team to give Hadoop users UNIX level
access
Hadoop challenges – setting expectations

18
IBM Open Platform – Foundation of
100% pure open source Apache Hadoop
components
Standardizing as the Open Data
Platform (http://opendataplatform.org)
About the IBM Open Platform for Apache Hadoop
All Standard Apache Open Source Components
HDFS
YARN
MapReduce
Ambari HBase
Spark
Flume
Hive Pig
Sqoop
HCatalog
Solr/Lucene
ODP

Application Portability & Integration
Data shared with Hadoop ecosystem
Comprehensive file format support
Superior enablement of IBM software
Enhanced by Third Party software
Performance
Modern MPP runtime
Powerful SQL query rewriter
Cost based optimizer
Optimized for concurrent user throughput
Federation
Distributed requests to multiple data sources within
a single SQL statement
Main data sources supported:
DB2, Teradata, Oracle, Netezza, MS SQL Server,
Informix
Enterprise Features
Advanced security/auditing
Resource and workload management
Self tuning memory management
Comprehensive monitoring
Rich SQL
Comprehensive SQL Support
IBM SQL PL compatibility
Extensive Analytic Functions
Big SQL At a Glance

Big SQL Roadmap 2015
► Support for ODP-compliant distros, including Hortonworks!
► Support for pLinux and zLinux
► Major performance enhancements
► Spark Integration
► Further enhancements to the UI
► Better SQL compatibility with other SQL engines
► Hbase update/delete support
► Hive ACID update/delete
► More Analytic enhancements (e.g. user define aggregates)

50 Shades of SQL

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a 50 Shades of SQL

Semelhante a 50 Shades of SQL (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

50 Shades of SQL