Every day we roughly create 2.5 Quintillion bytes of data; 90% of the worlds collected data has been generated only in the last 2 years. In this slide, learn the all about big data
in a simple and easiest way.
2. What is Big Data?
• Huge Amount of Data (Terabytes or Petabytes)
• Big data is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications
• The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and
visualization
LET’S SOLVE THIS
PROBLEM BY USING THE
BIG DATA NONE OF US
HAVE THE SLIGHTEST
IDEA WHAT TO DO WITH
3. What is Big Data? (Cont’d)
• Every day we roughly create 2.5 Quintillion bytes of data;
90% of the worlds collected data has been generated only
in the last 2 years
• Data sizes are now in Peta-bytes, Tera-bytes, Exa-bytes &
Zeta-bytes
• Data is of various types – user content (posts, tweets,)
images, audio and video files, chat logs, machine
generated data
• The number of speed of data generation has increased –
every data we generate 2,500,000,000,000,000 (2.5
quintillion) bytes of data EVERY DAY
• The data sources have multiplied many times over – social
media, banking, government, e-commerce etc.
4. Purpose of Big Data or Why Big Data?
• In 2015, we created data more than all of past years data combined. But 90% of our created data is not
sorted.
• Big Data gives us better and different picture of data collected. But the purpose is not solved yet. We need
to find out a better way of using it as well.
• And for all these you need the best data scientist you can possible get your hands on.
• Companies love data because it provided very lucrative insights to businesses and their clients - Data today
is the most important commodity. If companies can harness and analyse data, it provides an unmatchable
competitive advantage
• New age tools like Hadoop have made data handling and processing easier and cheaper
• Cloud computing, hardware and memory is getting cheaper so data storage is not a problem
• New analytical software provides real time analysis of data for business decisions
5. Big Data Market Trend
• Market research firm IDC forecasts a 50% increase in revenues from the sale of big data and
business analytics software, hardware, and services between 2015 and 2019.
• It says, services will account for the biggest chunk of revenue, with banking and
manufacturing-led industries poised to spend the most.
• By 2019, IDC said it expects revenue generated by the US market for big data and business
analytics solutions to exceed $98 billion.
• Edge analytics is the next big thing in the Big Data technology for companies that want to
gain real-time insights and impact business through IoT use cases. will allow companies to
derive the most actionable value from their data.
• The key trend in Big Data is the transition of analytics solutions into the cloud. The cloud
enables vast amount of computing resources to be applied to data analysis and to scale that
computing based on need.
6. Big Data Learning Challenges
• Upgradation: In every 3 to 6 months a new version of technology is introduced. So by the time you have actually
learned something, a new technology would have emerged making your knowledge old and outdated.
• Data Extraction: Challenge is to extract the most important information out of the massive data formation.
• Lack of Talent: A successful implementation of Big Data project requires a sophisticated team of developers, data
scientists and analysts having a sufficient knowledge of Big Data. But the available skill is very limited in number.
• Data Quality: United State’s cost of procuring dirty data every year is $600 billion. The common causes for these
data are- user input errors, data duplication & incorrect data linking.
7. Metrics of Data Size
Till date, we’ve only been familiar with data in GB. The world in changing now :
Data Metrics Hierarchy
1024 MEGABYTES = 1 GB
1024 GIGABYTES = 1 TB
1024 TERABYTES = 1 PB
1024 PETABYTES = 1 EB
1024 EXABYTES = 1 ZETABYTES
1024 KILOBYTES = 1 MB
1024 BYTES = 1 KB
8. Why Has Data Generation Increased?
The advent of the internet is the primary reason for the data explosion that has taken place over the last 15 years
• Facebook - more than 2.5 million pieces of content
• Twitter - 300 thousand tweets at Twitter
• Instagram - 250,000 new pictures
• YouTube - 75 hours of new video content
• Email - 400 million messages
• WhatsApp - 400,000 pictures
• Google - 5 million search requests
Every minute:
9. Other sources of Data generation
• E-commerce – More than1.3 billion transactions per day
• Data logging devices – Wearables (ex: FitBit) Healthcare monitoring, GPS systems, Data sensors
• Financial Data – Stock exchanges, Banking transactions
• Aviation Industry – A typical flight generates half a TB of data
• Governments – Citizen data, Tax records etc.
10. Companies Want More Data !!
• Companies love data because it provided very lucrative insights to businesses and their clients - Data today is
the most important commodity. If companies can harness and analyze data, it provides an unmatchable
competitive advantage
• New age tools like Hadoop have made data handling and processing easier and cheaper
• Cloud computing, hardware and memory is getting cheaper so data storage is not a problem
• New analytical software provides real time analysis of data for business decisions
11. Characteristics of Big Data
• Big Data is characterized by 4 V’s: Volume, Velocity, Variety and Veracity
BigData
Volume
Velocity
Variety
Veracity
12. Characteristics of Big Data – Volume
Volume: Refers to the enormous volumes of data
Data Warehouses
13. Characteristics of Big Data – Velocity
Velocity: Itrefers topace at which data is being generated, processed and consumed
14. Characteristics of Big Data – Variety
Variety: Data can be gathered from infinite sources
15. Characteristics of Big Data – Veracity
Veracity: The quality and authenticity of the data being captured can vary greatly
16. Scope: Testing Aspects in Big Data
• Validation of Structured and Unstructured Data: Data needs to be classified as the structured
and unstructured parts.
(i) Structured Data: It is the data which can be stored in the form of tables (rows and columns) without any
processing for example database, call details and excel sheets.
(ii) Unstructured Data: It is the data which does not have a predefined data model or structure for example data
in the form of weblogs, audio, tweets, and comments.
Adequate time needs to be spent over the validation of the data at an initial stage, and it is the point
where we encounter an abundance of bad data from various sources.
• Execution of Non-Functional Testing: Non-functional testing plays a vital role in ensuring the
scalability of the process. Functional testing focuses on the coding and requirement related issues
whereas non-functional testing classifies the performance bottlenecks and validates the non-
functional requirements.
• Handling Non-Relational Databases: Non-Relational databases form the backbone of the Big Data
storage. Since these are the main sources of data retrieval hence require a good portion of testing to maintain
the accuracy of the system. Commonly known as NoSQL databases, these DBs are designed in such a manner
that they can early handle the Big Data and are different from the traditional RDBMS which are designed on
table/key model.
17. • Ace Test Environment: Efficient test environment ensures that data from multiple sources is
of acceptable quality for accurate analysis. Although replicating the complete set of big data
into the test environment is next to impossible, so a small subset of the data is created for
the test environment to verify the behavior. Careful planning is required to exercise all paths
with subsets of data in a manner that fully verifies the application.
18. Phases of testing in Big Data and Hadoop
Testing of Big Data and Hadoop is an enormous and complex process which is segregated into
four phases to squeeze out the best results from testing. These phases are as follows:
1. Pre-Hadoop Processing: It includes the validation of the data which is collated from various sources
before Hadoop processing. This is the phase where we get rid of unwanted data.
2. Processing of Map Reduce Job: Map R job in Hadoop is the java code which is used to fetch out
the data according to the preconditions provided. Verification of the Map Reduce job is performed to
monitor the accuracy of the data fetched.
3. Data Extraction and Loading: This process of includes the validation of the data being loaded and
extracted from the HDFS (Hadoop Distributed File System) to ensure that no corrupt data occupied in
the HDFS.
4. Report Validation: This is the last phase of testing to ensure that the output which we are delivering
is meeting the accuracy standards, and there is no redundant data present in the reports.
20. Become a Big Data expert with Hortonworks certification only at SpringPeople!
Get updates on upcoming classes and webinars at www.springpeople.com
Notas do Editor
Do you guys know what a Tsunami is? Well, that is how BIG Data hit the technology world over the last 8 years. Only in the last 2 years, we’ve generated 90% of the worlds available data and this is just the beginning.
It is not a technology. It is not a tool. BIG Data is extremely large volumes of data, mostly unstructured in nature, which cannot be stored, processed or managed by traditional RDBMS tools. Till date, we’re very familiar with Giga-bytes & Mega-bytes. However, BIG Data runs in Petabytes, Terabytes and beyond.
Just to give you a comparison; here is a table showcasing the difference between each data-metric. Today, we have data being generated running even into Zeta-bytes. Just to give you a little context, a Zeta-byte is one followed by 21 zeroes.
Now comes a very interesting question; what led to this data-explosion over the last 10 years?
They are 3 triggers for this.
Access to data became easier after the advent of the internet; Never before did the human race have access to such in-depth information.
2. BIG Data provided extremely lucrative insights to business organizations.
3. Due to this cycle of access, insight & results, investments were done to ensure as much data as possible is collected.
This led to the explosion of BIG Data.
Lets see a few examples of this in actual business.
They are 3 triggers for this.
Access to data became easier after the advent of the internet; Never before did the human race have access to such in-depth information.
2. BIG Data provided extremely lucrative insights to business organizations.
3. Due to this cycle of access, insight & results, investments were done to ensure as much data as possible is collected.
This led to the explosion of BIG Data.
Lets see a few examples of this in actual business.
They are 3 triggers for this.
Access to data became easier after the advent of the internet; Never before did the human race have access to such in-depth information.
2. BIG Data provided extremely lucrative insights to business organizations.
3. Due to this cycle of access, insight & results, investments were done to ensure as much data as possible is collected.
This led to the explosion of BIG Data.
Lets see a few examples of this in actual business.
Till now, we’ve established that BIG Data is large, complex and diverse. Those qualities are broken down into the following 4 characteristics.
Volume
Velocity
Variety
Veracity
Lets break down each and every one of them.
As the name suggest, this characteristic refers to the sheer volume of BIG Data generated in the world today from a multitude of sources. For example, an airplane collects 10 tera-bytes of sensor data for every 30 minutes of flying time.
Today, we’re hitting data generation to the scale of zeta-bytes. This is absolutely unprecedented.
Next is Velocity.
This refers to the speed at which data is generated, managed and processed. BIG Data is characterized by constantly increasing velocity. This is because an increase in data generation has led to the creation of advanced distributed processing networks and even real-time analytics tools.
Due to this, the velocity of BIG Data activities is constantly increasing.
Next.
BIG Data can be either structured, semi-structured or unstructured and from infinite sources. In addition to that, It can be from mediums such as Geospatial, 3D, Audio, Video, Log Files, and Network Algorithms & Social Media.
Due to this, variety is a very crucial characteristic of BIG Data.
This refers to the biases, noise and abnormality in Big Data. Due to this extreme caution has to be applied in the data-collection process. Companies spend billions of dollars either as penalties for false data-collection or on data-cleansing.
These are the 4 characteristics of BIG Data. Now lets move on to the types of BIG Data.