08448380779 Call Girls In Friends Colony Women Seeking Men
Hadoop: Extending your Data Warehouse
1. 1
Hadoop: Extending Your Data Warehouse
Tony Baer | Principal Analyst, Ovum
Moderated by Matt Brandwein | Product Marketing Manager, Cloudera
May 9, 2013
2. Welcome to the webinar!
• All lines are muted
• Q&A after the presentation
• Ask questions at any time by typing them in the
“Questions” pane on your WebEx panel
• Recording of this webinar will be available
on-demand at cloudera.com
• Join the conversation on Twitter:
@cloudera @TonyBaer #EDWHadoop
2
3. Who is Cloudera?
3
What the Enterprise
Requires
Only 100% open source
Hadoop-based platform
with both batch and real-
time processing engines,
enterprise-ready with
native high availability
Suite of system and data
management software
Comprehensive support
and consulting services
Broadest Hadoop training
and certification programs
Extensive Partner
Ecosystem
Over 600 partners across
hardware, software and
services
The Leader in
Big Data
Management
Deliver a revolutionary
data management
platform powered by
Apache Hadoop
World’s leading
commercial vendor of
Apache Hadoop
Enable organizations to
improve operational
efficiency and Ask
Bigger Questions of all
their data
Customers & Users
Across Industries
More production
deployments than all other
vendors combined
24. Impala: Cloudera’s Design Strategy
24
Storage
Integration
Resource Management
Metadata
Batch
Processing
MAPREDUCE,
HIVE & PIG
…
Interactive
SQL
IMPALA
Math
Machine
Learning, Anal
ytics
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Engines
Complement MapReduce with
interactive MPP SQL engine
One pool of data
One metadata model
One security framework
One set of system resources
100% open source
An Integrated Part of the Hadoop Platform
25. Impala Use Cases
25
Interactive BI/analytics on more data
Asking new questions
Data processing with tight SLAs
Query-able archive w/ full fidelity
Cost-effective, ad hoc query environment that
offloads the data warehouse for:
27. Questions?
27
• Type in the “Questions” panel
• Tweet @cloudera #EDWHadoop
• Recording will be available
on-demand at cloudera.com
• Contact us:
tony.baer@ovum.com
Twitter: @TonyBaer
mbrandwein@cloudera.com
Twitter: @MattBrandwein
Thank you for attending!
Try Cloudera today
cloudera.com/downloads
Learn more about Impala
cloudera.com/impala
Get Hadoop Training
university.cloudera.com
Ready to go?
Check out Cloudera Quickstart
cloudera.com/quickstart
Notas do Editor
The rationale for multi-tiered DW architecture
EDWs are straining to keep up with new demands being placed on them. Data volumes are snowballing and increasing familiar analytic problems are consuming new forms of data.Customer retention– for mass markets, social networks are generating new insights on what customers really think, and who is influencing them. For telcos, preventing customer churn goes well beyond dropped calls. Internet, IM, text, and yes… email… are becoming the bulk of mobile carrier traffic, multiplying the volumes and types of log files that must be dissected to understand the customer experienceOperational efficiency means tapping into the Internet of things – M2M– in addition to traditional OLTP systems for logistics providers to delver goods on time, for airlines to efficiency sequence airport operations, utilities to manage generation & transmission, or mfrs to fine tune operations on the shop floor.Risk mitigation must expand the range of transactions and track externalities to understand their exposure to loss.Data retention requirements – especially in regulated industries – forcing organizations to keep more data longer, forcing hard decisions of what data to keep live.This is breaking the establishing EDW model, optimized for transforming MBytes/GBytes of well-understood structured data. It is breaking the ETL model as data transformations are bursting their batch windows. Internet players such as Facebook discovered this years ago as their nightly batch windows to MySQL DWs were exceeding 24 hrs. the same issue is now crossing over the mainstream enterporises.
Surging data volumes drove the need to flatten BI architecture, shifting data transformation loads onto the target system to a pattern known as Extract/Load/Transform (ELT).The DW was still separate from source systems. But in place of a staging server where data was drawn in, transformed, and then moved to a target, data transform was placed inside the EDW. Emergence of ELT pattern reflected reality of shrinking batch windows, and need to minimize data movements.
The obvious advantage is that data movements are reduced; only a single movement from source to target was needed. The transformation workload was co-located to where the data was stored and analyzed.
Inexpensive data transformation platformCompute cycles cheap because low-cost platform.Well suited for performing batch transformation of data to downstream SQL DWs/data marts. Can replace your ETL staging server.Accommodates all kinds of dataNo need to lay out tables ahead of time because you are loading to a file system.HDFS efficient for sequential loading & processing of all kinds of data.Keep your options openDon’t force-fit data & analyticsLate-binding approach to structuring dataData schema can evolve over time as new sources of data become availableHadoop has multiple options for representing structured data. You can add a Hive metadata layer and/or persist it in HBase tables.Extensibility – Hadoop becoming a platform with multiple personalitiesOriginated with MapReduce processingMany alternatives emerging for different styles of processingStream processing, graph, HPC patterns, etc.Applying data mining algorithms to uncover new insightsNew frameworks emerging rapidlyBest of both worldsSQL convergence via Hive, emerging frameworks such as ImpalaSQL querying for exploring and understanding your data through familiar BI tools;ExtensibilityLow cost of storage allows raw data & transformed data to be kept side-by-sideAbility to accommodate variably structured data allows orgs to gain visibility into data & data sources traditionally outside the reach of SQL DWs
Hadoop can replace your ETL server
Ovum believes that the hot spot for Hadoop development in 2013 is convergence with SQL.Cloudera has been an active player in making Hadoop SQL-friendly. It has long partnered with leading ETL, BI, and Data warehousing platform and tool providers to offer connectivity between Hadoop and SQL platforms. In turn, many of these technology providers are taking connectivity to new levels by extending their offerings to venture beyond interfacing to Hadoop to operating natively within it.Cloudera’s introduction of the Impala open source framework takes Hadoop-SQL convergence to the next level. Impala, an Apache open source project developed by Cloudera, brings interactive SQL query directly to Hadoop. It offers a high-performance, massively parallel processing framework that works against any Hadoop file format. While Impala utilizes the Hive metadata store, it provides a higher-performance alternative to relying on batch-oriented MapReduce and Hive processes.Impala helps business analysts iterate modeling for data that may eventually be migrated to a data warehouse. Low-cost platforms for iteratively discovering and structuring data.
Cloudera Navigator, a new feature of Cloudera Manager, tracks how data is utilized; specifically, it compiles an audit trail detailing what operations were performed against specific pieces of data, by whom, and when. In its initial release, Navigator will track activity against HDFS, Hive, and HBase.
Our design strategy is to tightly integrate and couple Impala within the Hadoop system. Impala (and interactive SQL) is just another application that you bring to your data. It’s integrated with Hadoop’s existing security and resource management frameworks and is completely interoperable with existing data formats and processing engines.One pool of dataStorage platforms (HDFS & HBase)Open data formats (files & records)Shared across multiple processing frameworksOne metadata modelNo synchronization of metadata between 2 different systems (analytical DBMS and Hadoop)Same metadata used by other components within Hadoop itself (Hive, Pig, Impala, etc.)One security frameworkSingle model for all of HadoopDoesn’t require “turning off” any portion of native Hadoop securityOne set of system resourcesOne set of nodes – storage, CPU, memoryOne management consoleIntegrated resource managementScale linearly as capacity or performance needs grow
Interactive BI/Analytics on more dataRaw, full fidelity data – nothing lost through aggregation or ETL/LTNew sources & types – structured/unstructuredHistorical dataAsking new questionsExploration and data discovery for analytics and machine learning – need to find a data set for a model, which requires lots of simple queries to summarize, count, and validate.Hypothesis testing – avoid having to subset and fit the data to a warehouse just to ask a single questionData processing with tight SLAsCost-effective platformMinimize data movementReduce strain on data warehouseQuery-able storageReplace production data warehouse for DR/active archiveStore decades of data cost effectively (for better modeling or data retention mandates) without sacrificing the capability to analyze