SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
CA1 Literature Review: Data Lakes
Are Data Lakes the new Data Warehouse?
	
  
Tom Donoghue v1.0 Page 1
	
  
Are Data Lakes the new Data Warehouse?
Can data lakes provide an organisation with a radical approach to harnessing data,
discovering information and acquiring knowledge, based on Golfarelli’s Business
Intelligence (BI) definition of data, information and knowledge?
Introduction
This paper describes the concept of a data lake and how it compares to a data warehouse.
We review recent research and discuss the definition of both repositories, what types of data
are catered for? Does ingesting data make it available for forging information and beyond
into knowledge? What types of people, process and tools need to be involved to realise the
benefits of using a data lake?
Data Lakes and Data Warehouse?
Sharma (2016) points out that organisations are facing a barrage of data, generated internally
and externally (especially via internet based platforms). Data generation continues to
accelerate, the breadth of unstructured and semi-structured data is in step with this
acceleration. Current systems and methodologies need to change and adapt to the demands
of big data processing. Two areas impacted are the data lake and data warehouse which are
described below and in Figure 1.
Halter et al. (2016) suggest that a data lake provides an alternative way to store high volumes
of data in its native format (be that unstructured, semi-structured or structured) at relatively
low storage costs. The data schemas are unknown when data is loaded, but are revealed as
data in the lake is accessed.
O'Leary (2014) describes a data warehouse as a bolt-on to existing operational systems,
consisting of structured data associated with a specific user base and a specific set of
predefined business queries. The data schema is predefined and structured to facilitate
regular queries. Populating the data warehouse requires multiple extract, transformation and
load (ETL) processes which are also designed in advance.
Are Data Lakes the new Data Warehouse?
	
  
Tom Donoghue v1.0 Page 2
	
  
Aspect Data Lake Data Warehouse
Data Sources Many Few
Data types Unstructured
Semi-structured
Structured
Structured
Schema required on
Load
No, data loaded without
knowledge of data schema
Yes, data schema known prior
to load
Set-up and
configuration
Low implementation cost with
open source components
Specialist skills may be scarce
High cost of proprietary
software licenses, design,
development and maintenance
Near real time data Yes, time between data load and
explore is far shorter
Poor, data tends to have
historic profile. Data only
available once ETL jobs have
completed
Ad hoc query Yes, queries authored at run
time
No, questions asked in
advance, structure must
support query.
Queries authored at design
time.
Flexible support for
cross organisational
questions / analysis
Correct approach provides a
variety of result sets for a wider
and diverse audience
Poor, inflexible predefined
structures only support specific
demands of a known user base
Figure 1: Key aspects of data lakes and data warehouses based on O'Leary (2014) and
Watson (2015).
Harnessing Data
Taking opinion and understanding gained from conference discussions focused on data
lakes, Watson (2015) considers that a data lake is sometimes used as a precursor data store.
Such a store is capable of ingesting copious amounts of unstructured, semi-structured and
structured data, whilst the format of the data is retained. The above suggests that multiple
data type capture is possible, and ties in with the definition above on data type and raw
format preservation. However, it is not clear that amassing data is actually harnessing data.
Fitzgerald (2015) in an interview with General Electric covering their experience of an
operational data lake, notes that at the point of ingestion the data schema is unknown. The
outcome of how data will be used in downstream processes and whether it will add value is
not yet apparent. Industry case studies conducted by Halter et al. (2016) further suggest that,
the data lake is a viable staging candidate for data warehouse input, for example, when
processing unstructured real time data sourced from the internet, data streams and social
media.
Are Data Lakes the new Data Warehouse?
	
  
Tom Donoghue v1.0 Page 3
	
  
Discovering Information
Through studies and exploration of the concept of big data, Sharma (2016) confers that the
data lake can provide a rich source of data for rudimentary exploration by skilled data
scientists and analysts. Fitzgerald (2015) found that General Electric saw 80% of their
talented data scientists’ time was spent on wrangling data into useful information rather than
building models for exploring the outcomes. This indicates the importance of correct
resource allocation in order to glean information from data whilst keeping costs within
acceptable business limits.
In an exploration of industry and academic approaches to BI, data warehousing and big data,
O'Leary (2014) discusses the use of Master Data Management to help mitigate common data
issues. For instance, data inconsistencies appear due to multiple data sources and data
redundancy occurs owing to multiple copies of the same data item. Identifying master data
and its fitness for purpose provides clarity for the organisation including the multiple
applications which rely on data to be consistent. Creation of meaningful metadata attached
to cleansed lake data assists information discovery. Sharma (2016) suggests that it is
plausible to turn a raw data lake into a “smart” data lake through the use of semantic graph
models. Adding context to data facilitates awareness and usability, which gives rise to
information.
Acquiring Knowledge
Halter et al. (2016) confer that a data lake may present an organisation with a competitive
advantage. This means being capable of conducting data analytics and forming insights to
assist business decision making via the acquisition of meaning from disparate data sources.
Taking a business perspective, it is worthwhile discussing and forming processes with
business decision makers to define what data to populate the lake with in the first place
(Watson, 2015). This in turn provides the scope on which to start the search for information,
culminating in knowledge acquisition.
Folding big data in with traditional organisational data for modern data analytics requires the
use of new forms of technology designed specifically to bring about desired results. Based
on the speed, amount and mix of data in this context, existing systems will need to adapt or
be replaced. Queries required to produce sought-after outcomes may well be searching for
data which does not exist in the data warehouse according to Watson (2015). Similar big data
pressure to adapt to change is also recognised in Sharma (2016).
Evaluation
A data lake may not be a panacea for resolving the data issues mentioned above, but it is a
technique that could complement the data warehouse. Both have different underlying
structural requirements and a varied user base which require a varied skillset in order to
extract value from both services (Halter et al., 2016). However, the temptation of lower entry
cost, emerging tool combinations that contextualise data and the expectation of a flexible
and usable way to deal with the surge of big data may attract organisations to build data
lakes. Figure 2 illustrates possible associations between people, process and tools as part
of this evaluation. Examining the suitability of emerging tools in an industry case study,
Armstrong and Barnes (2016) suggests that Hadoop is a common tool of choice for data
lakes due to its low cost of entry and ability to soak up a wide variety of unstructured and
Are Data Lakes the new Data Warehouse?
	
  
Tom Donoghue v1.0 Page 4
	
  
semi-structured data. Tools such as Hadoop combined with NoSQL (Halter et al., 2016) will
facilitate early adopters of data lakes.
Figure 2: People, Process and Tools, based on information compiled from Golfarelli (2004);
Watson (2015) and Fitzgerald (2015).
Further research is required around which emerging tools increase data lake access,
usability, interrogation and security. Skilled resource cost is a common thread, clear role
definition and people management should be examined further to avoid wasteful resource
deployment. People are required to maintain and administer Hadoop based systems, probe
the data lake, identify valuable data for input to downstream experimentation, discovery and
proof of concept generation. Processes also require attention, the risk and impact of new
legislation together with, as suggested by Fitzgerald (2015), gaining a deeper understanding
of governance, provenance and how data is managed when at rest or in transit across
boundaries. A large investment already made in existing data warehouse architecture and
ETL implementations may preclude the adoption of data lakes. Evidence comparing return
on investment for typical data lake and warehouse use cases is an appealing area for further
research. However, according to Armstrong and Barnes (2016), as tools in this space evolve,
use of sandboxes and selective migration of ETL processes into the data lake provide
meaningful feedback to support proof of concept efforts.
If the goal is a unified, consolidated master data store which fully supports integrated
disparate data, capable of serving various levels of analytics (e.g. real time, predictive and
historical) across the entire organisation? Then data lakes could be the first step on that
journey. Its implementation requires skilled resources that create consistent metadata and
data modelling to ensure meaningful outcomes (O'Leary, 2014). The project requires a
business driven strategy (Halter et al., 2016), buy-in by senior management to align priorities
and to connect the technology road map to defined business objectives (Armstrong and
Barnes, 2016).
Are Data Lakes the new Data Warehouse?
	
  
Tom Donoghue v1.0 Page 5
	
  
Bibliography
Armstrong, R. and Barnes, S. (2016) ‘When It's Time to Hadoop’, Business Intelligence
Journal, Volume 21, Issue 1, pp. 32-38.
Fitzgerald, M. (2015), ‘Gone Fishing - for Data’, MIT Sloan Management Review, Volume 56,
Issue 3, pp. 1-5.
Golfarelli, M., Rizzi, S. and Cella, I. (2004). Beyond data warehousing: what's next in
business intelligence? in ‘Proceedings of the 7th ACM international workshop on Data
warehousing and OLAP’, DOLAP ’04. Washington, DC, USA,12-13 November, 2004, pp.
1-6.
Halter, O. and Kromer, M. (2016), ‘Dipping a Toe into Data Lake’, Business Intelligence
Journal, Volume 21, Issue 2, pp. 40-46.
O'Leary, D. E. (2014), ‘Embedding AI and Crowdsourcing in the Big Data Lake’, IEEE
Intelligent Systems, Volume 29, Issue 5, pp. 70-73.
Sharma, S. (2016), ‘Expanded cloud plumes hiding Big Data ecosystem’, Future Generation
Computer Systems, Volume 59, pp. 63-92.
Watson, H. J. (2015), ‘Data Lakes, Data Labs, and Sandboxes’, Business Intelligence
Journal, Volume 20, Issue 1, pp. 4-7.

Mais conteúdo relacionado

Mais procurados

Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesKaran Deep Singh
 
Massive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsMassive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsVijay Raghavan
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperImpetus Technologies
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdfAkuhuruf
 
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdfbig-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdfAkuhuruf
 
11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data miningAlexander Decker
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective ApproachIRJET Journal
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
 
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.ijceronline
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Surveyijeei-iaes
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345AkhilSinghal21
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceInformation Security Awareness Group
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
 

Mais procurados (20)

Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
Massive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsMassive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and Applications
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdf
 
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdfbig-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
 
11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White Paper
 
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Survey
 
big data
big databig data
big data
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security Alliance
 
Metadata in Business Intelligence
Metadata in Business IntelligenceMetadata in Business Intelligence
Metadata in Business Intelligence
 
Big data
Big dataBig data
Big data
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
3 classification
3  classification3  classification
3 classification
 

Semelhante a Data Lakes versus Data Warehouses

Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Jennifer Walker
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextMurad Daryousse
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lakesambiswal
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
Modern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleModern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleVasu S
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesIJRESJOURNAL
 
Accelerate Data Discovery
Accelerate Data Discovery   Accelerate Data Discovery
Accelerate Data Discovery Attivio
 
Running head DATABASE AND DATA WAREHOUSING DESIGNDATABASE AND.docx
Running head DATABASE AND DATA WAREHOUSING DESIGNDATABASE AND.docxRunning head DATABASE AND DATA WAREHOUSING DESIGNDATABASE AND.docx
Running head DATABASE AND DATA WAREHOUSING DESIGNDATABASE AND.docxtodd271
 
TDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DWTDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DWJeannette Browning
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET Journal
 

Semelhante a Data Lakes versus Data Warehouses (20)

Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data Context
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
Big Data
Big DataBig Data
Big Data
 
Modern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleModern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | Qubole
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and Perspectives
 
Big data storage
Big data storageBig data storage
Big data storage
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
Accelerate Data Discovery
Accelerate Data Discovery   Accelerate Data Discovery
Accelerate Data Discovery
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
Running head DATABASE AND DATA WAREHOUSING DESIGNDATABASE AND.docx
Running head DATABASE AND DATA WAREHOUSING DESIGNDATABASE AND.docxRunning head DATABASE AND DATA WAREHOUSING DESIGNDATABASE AND.docx
Running head DATABASE AND DATA WAREHOUSING DESIGNDATABASE AND.docx
 
Abstract
AbstractAbstract
Abstract
 
TDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DWTDWI checklist - Evolving to Modern DW
TDWI checklist - Evolving to Modern DW
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
IRJET- A Scrutiny on Research Analysis of Big Data Analytical Method and Clou...
 

Mais de Tom Donoghue

Data warehousing and machine learning primer
Data warehousing and machine learning primerData warehousing and machine learning primer
Data warehousing and machine learning primerTom Donoghue
 
Chicago Crime Analysis
Chicago Crime AnalysisChicago Crime Analysis
Chicago Crime AnalysisTom Donoghue
 
The Prepared Executive: A Linguistic Exploration
The Prepared Executive: A Linguistic ExplorationThe Prepared Executive: A Linguistic Exploration
The Prepared Executive: A Linguistic ExplorationTom Donoghue
 
Crime Analysis using Regression and ANOVA
Crime Analysis using Regression and ANOVACrime Analysis using Regression and ANOVA
Crime Analysis using Regression and ANOVATom Donoghue
 
Exploration of Call Transcripts with MapReduce and Zipf’s Law
Exploration of Call Transcripts with MapReduce and Zipf’s LawExploration of Call Transcripts with MapReduce and Zipf’s Law
Exploration of Call Transcripts with MapReduce and Zipf’s LawTom Donoghue
 
Internet of Things (IoT) in the Fog
Internet of Things (IoT) in the FogInternet of Things (IoT) in the Fog
Internet of Things (IoT) in the FogTom Donoghue
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report Tom Donoghue
 

Mais de Tom Donoghue (7)

Data warehousing and machine learning primer
Data warehousing and machine learning primerData warehousing and machine learning primer
Data warehousing and machine learning primer
 
Chicago Crime Analysis
Chicago Crime AnalysisChicago Crime Analysis
Chicago Crime Analysis
 
The Prepared Executive: A Linguistic Exploration
The Prepared Executive: A Linguistic ExplorationThe Prepared Executive: A Linguistic Exploration
The Prepared Executive: A Linguistic Exploration
 
Crime Analysis using Regression and ANOVA
Crime Analysis using Regression and ANOVACrime Analysis using Regression and ANOVA
Crime Analysis using Regression and ANOVA
 
Exploration of Call Transcripts with MapReduce and Zipf’s Law
Exploration of Call Transcripts with MapReduce and Zipf’s LawExploration of Call Transcripts with MapReduce and Zipf’s Law
Exploration of Call Transcripts with MapReduce and Zipf’s Law
 
Internet of Things (IoT) in the Fog
Internet of Things (IoT) in the FogInternet of Things (IoT) in the Fog
Internet of Things (IoT) in the Fog
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report
 

Último

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Último (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 

Data Lakes versus Data Warehouses

  • 2. Are Data Lakes the new Data Warehouse?   Tom Donoghue v1.0 Page 1   Are Data Lakes the new Data Warehouse? Can data lakes provide an organisation with a radical approach to harnessing data, discovering information and acquiring knowledge, based on Golfarelli’s Business Intelligence (BI) definition of data, information and knowledge? Introduction This paper describes the concept of a data lake and how it compares to a data warehouse. We review recent research and discuss the definition of both repositories, what types of data are catered for? Does ingesting data make it available for forging information and beyond into knowledge? What types of people, process and tools need to be involved to realise the benefits of using a data lake? Data Lakes and Data Warehouse? Sharma (2016) points out that organisations are facing a barrage of data, generated internally and externally (especially via internet based platforms). Data generation continues to accelerate, the breadth of unstructured and semi-structured data is in step with this acceleration. Current systems and methodologies need to change and adapt to the demands of big data processing. Two areas impacted are the data lake and data warehouse which are described below and in Figure 1. Halter et al. (2016) suggest that a data lake provides an alternative way to store high volumes of data in its native format (be that unstructured, semi-structured or structured) at relatively low storage costs. The data schemas are unknown when data is loaded, but are revealed as data in the lake is accessed. O'Leary (2014) describes a data warehouse as a bolt-on to existing operational systems, consisting of structured data associated with a specific user base and a specific set of predefined business queries. The data schema is predefined and structured to facilitate regular queries. Populating the data warehouse requires multiple extract, transformation and load (ETL) processes which are also designed in advance.
  • 3. Are Data Lakes the new Data Warehouse?   Tom Donoghue v1.0 Page 2   Aspect Data Lake Data Warehouse Data Sources Many Few Data types Unstructured Semi-structured Structured Structured Schema required on Load No, data loaded without knowledge of data schema Yes, data schema known prior to load Set-up and configuration Low implementation cost with open source components Specialist skills may be scarce High cost of proprietary software licenses, design, development and maintenance Near real time data Yes, time between data load and explore is far shorter Poor, data tends to have historic profile. Data only available once ETL jobs have completed Ad hoc query Yes, queries authored at run time No, questions asked in advance, structure must support query. Queries authored at design time. Flexible support for cross organisational questions / analysis Correct approach provides a variety of result sets for a wider and diverse audience Poor, inflexible predefined structures only support specific demands of a known user base Figure 1: Key aspects of data lakes and data warehouses based on O'Leary (2014) and Watson (2015). Harnessing Data Taking opinion and understanding gained from conference discussions focused on data lakes, Watson (2015) considers that a data lake is sometimes used as a precursor data store. Such a store is capable of ingesting copious amounts of unstructured, semi-structured and structured data, whilst the format of the data is retained. The above suggests that multiple data type capture is possible, and ties in with the definition above on data type and raw format preservation. However, it is not clear that amassing data is actually harnessing data. Fitzgerald (2015) in an interview with General Electric covering their experience of an operational data lake, notes that at the point of ingestion the data schema is unknown. The outcome of how data will be used in downstream processes and whether it will add value is not yet apparent. Industry case studies conducted by Halter et al. (2016) further suggest that, the data lake is a viable staging candidate for data warehouse input, for example, when processing unstructured real time data sourced from the internet, data streams and social media.
  • 4. Are Data Lakes the new Data Warehouse?   Tom Donoghue v1.0 Page 3   Discovering Information Through studies and exploration of the concept of big data, Sharma (2016) confers that the data lake can provide a rich source of data for rudimentary exploration by skilled data scientists and analysts. Fitzgerald (2015) found that General Electric saw 80% of their talented data scientists’ time was spent on wrangling data into useful information rather than building models for exploring the outcomes. This indicates the importance of correct resource allocation in order to glean information from data whilst keeping costs within acceptable business limits. In an exploration of industry and academic approaches to BI, data warehousing and big data, O'Leary (2014) discusses the use of Master Data Management to help mitigate common data issues. For instance, data inconsistencies appear due to multiple data sources and data redundancy occurs owing to multiple copies of the same data item. Identifying master data and its fitness for purpose provides clarity for the organisation including the multiple applications which rely on data to be consistent. Creation of meaningful metadata attached to cleansed lake data assists information discovery. Sharma (2016) suggests that it is plausible to turn a raw data lake into a “smart” data lake through the use of semantic graph models. Adding context to data facilitates awareness and usability, which gives rise to information. Acquiring Knowledge Halter et al. (2016) confer that a data lake may present an organisation with a competitive advantage. This means being capable of conducting data analytics and forming insights to assist business decision making via the acquisition of meaning from disparate data sources. Taking a business perspective, it is worthwhile discussing and forming processes with business decision makers to define what data to populate the lake with in the first place (Watson, 2015). This in turn provides the scope on which to start the search for information, culminating in knowledge acquisition. Folding big data in with traditional organisational data for modern data analytics requires the use of new forms of technology designed specifically to bring about desired results. Based on the speed, amount and mix of data in this context, existing systems will need to adapt or be replaced. Queries required to produce sought-after outcomes may well be searching for data which does not exist in the data warehouse according to Watson (2015). Similar big data pressure to adapt to change is also recognised in Sharma (2016). Evaluation A data lake may not be a panacea for resolving the data issues mentioned above, but it is a technique that could complement the data warehouse. Both have different underlying structural requirements and a varied user base which require a varied skillset in order to extract value from both services (Halter et al., 2016). However, the temptation of lower entry cost, emerging tool combinations that contextualise data and the expectation of a flexible and usable way to deal with the surge of big data may attract organisations to build data lakes. Figure 2 illustrates possible associations between people, process and tools as part of this evaluation. Examining the suitability of emerging tools in an industry case study, Armstrong and Barnes (2016) suggests that Hadoop is a common tool of choice for data lakes due to its low cost of entry and ability to soak up a wide variety of unstructured and
  • 5. Are Data Lakes the new Data Warehouse?   Tom Donoghue v1.0 Page 4   semi-structured data. Tools such as Hadoop combined with NoSQL (Halter et al., 2016) will facilitate early adopters of data lakes. Figure 2: People, Process and Tools, based on information compiled from Golfarelli (2004); Watson (2015) and Fitzgerald (2015). Further research is required around which emerging tools increase data lake access, usability, interrogation and security. Skilled resource cost is a common thread, clear role definition and people management should be examined further to avoid wasteful resource deployment. People are required to maintain and administer Hadoop based systems, probe the data lake, identify valuable data for input to downstream experimentation, discovery and proof of concept generation. Processes also require attention, the risk and impact of new legislation together with, as suggested by Fitzgerald (2015), gaining a deeper understanding of governance, provenance and how data is managed when at rest or in transit across boundaries. A large investment already made in existing data warehouse architecture and ETL implementations may preclude the adoption of data lakes. Evidence comparing return on investment for typical data lake and warehouse use cases is an appealing area for further research. However, according to Armstrong and Barnes (2016), as tools in this space evolve, use of sandboxes and selective migration of ETL processes into the data lake provide meaningful feedback to support proof of concept efforts. If the goal is a unified, consolidated master data store which fully supports integrated disparate data, capable of serving various levels of analytics (e.g. real time, predictive and historical) across the entire organisation? Then data lakes could be the first step on that journey. Its implementation requires skilled resources that create consistent metadata and data modelling to ensure meaningful outcomes (O'Leary, 2014). The project requires a business driven strategy (Halter et al., 2016), buy-in by senior management to align priorities and to connect the technology road map to defined business objectives (Armstrong and Barnes, 2016).
  • 6. Are Data Lakes the new Data Warehouse?   Tom Donoghue v1.0 Page 5   Bibliography Armstrong, R. and Barnes, S. (2016) ‘When It's Time to Hadoop’, Business Intelligence Journal, Volume 21, Issue 1, pp. 32-38. Fitzgerald, M. (2015), ‘Gone Fishing - for Data’, MIT Sloan Management Review, Volume 56, Issue 3, pp. 1-5. Golfarelli, M., Rizzi, S. and Cella, I. (2004). Beyond data warehousing: what's next in business intelligence? in ‘Proceedings of the 7th ACM international workshop on Data warehousing and OLAP’, DOLAP ’04. Washington, DC, USA,12-13 November, 2004, pp. 1-6. Halter, O. and Kromer, M. (2016), ‘Dipping a Toe into Data Lake’, Business Intelligence Journal, Volume 21, Issue 2, pp. 40-46. O'Leary, D. E. (2014), ‘Embedding AI and Crowdsourcing in the Big Data Lake’, IEEE Intelligent Systems, Volume 29, Issue 5, pp. 70-73. Sharma, S. (2016), ‘Expanded cloud plumes hiding Big Data ecosystem’, Future Generation Computer Systems, Volume 59, pp. 63-92. Watson, H. J. (2015), ‘Data Lakes, Data Labs, and Sandboxes’, Business Intelligence Journal, Volume 20, Issue 1, pp. 4-7.