This Document Includes lecture/workshop notes for BIG DATA SCIENCE workshop at NTI 6-7th of Dec 2017
Hint: 1:This is an Initial Version, and it will be updated.
2: Telecommunication/5G parts were not covered through the workshop, although, I will add a comprehensive analysis regarding mentioned cases.
If anyone is interesting in working practically (HANDS ON) mentioned case study, just drop me an e-mail: m.rahm7n@gmail.com
2. AGENDA
BIG DATA !
BIG DATA: HYPE OR REALITY?
DEEP DIVE INTO THE INFRASTRUCTURE
BIG DATA SCENARIO A 2 Z
DATA ANALYTICS
DATA VISUALIZATION
EMOTION INTELIGENCE
WORD EMBEDDING IN NLP
DEEP LEARNING IN AUTONOMOUS CAR
PREDICTION MODELS IN OIL AND GAS
MICROSOFT AZURE
5G: IMT 2020
This Document Includes lecture/workshop notes regarding BIG
DATA SCIENCE workshop at NTI 6-7th of Dec 2017
https://www.linkedin.com/in/mrastro
4. “Big data is a term that describes the large volume of data – both structured
and unstructured – that inundates a business on a day-to-day basis. But it’s not
the amount of data that’s important. It’s what organizations do with the data
that matters. Big data can be analyzed for insights that lead to better decisions
and strategic business moves.” 2
Definition
“Big data is about looking ahead, beyond
what everybody else sees.” 1
Peter Sondergaard, senior vice president and global head of research at Gartner
Although there’s no fixed number marking the beginning of “big”, we’re talking much bigger
than conventional tools like spreadsheets and relational databases can handle easily. Many
case studies of big data involve datasets of many petabytes—or even exabytes—
made possible only by using high-performance cloud-based computing.
Many big-data applications, such as cancer research, use historical data, but much attention
is being paid to how to leverage real-time data—not just collected in real time, but processed
and accessed in real time too. In many scenarios, users must be able to ask questions
iteratively and get answers in minutes, not days.
Big data covers not just “structured” data neatly normalized into a fixed schema
and exported from ERP or CRM systems. It also includes semi-structured data,
(which, although it has no fixed configuration, is categorized using tags or other
metadata) and unstructured data, such as email messages and videos.
MOST DEFINITIONS OF BIG DATA AGREE THAT IT INVOLVES THE “THREE VS” 4
Any technology is only useful if it solves a problem (or problems).
As we all know, there is data, lots of it: historical data, sure, but also new
data generated from social media apps, click stream data from web
applications, IoT sensor data, and on and on. The amount of data is larger
than ever, coming in at ever-increasing rates, and in many different formats.
3
The
Problem
5. Gartner published earlier this year 2017 5 on emerging technologies.
They mention Many of the emerging technologies, including virtual personal
assistants, machine learning, the IoT, and M2M, use data to track performance and
generate big data to define success.
A closer look to the peak, we can see IoT, machine/deep learning with about 2-5 years to
diverse (expected between 2020-22) which creates a world of connectivity
And HINT
The Connected World Amplifies Big Data AND ITS EXISTENCE EVERYWHERE
7. Traditional Data Management Systems [6]
SHARED I/O
SHARED PROCESSING
LIMITED SCALABILITY
SERVICE BOTTLENECKS
HIGH COST FACTOR
Abstraction of BIG DATA Platform [6]
PARALLEL PROCESSING
LINEAR SCALABILITY
DISTRIBUTED SERVICE
LOW COST FACTOR
Notes: The Main Key Advantages of Distributed Systems are being Software Defined
where cluster is optimized for software execution (e.g Hadoop). Files/DataSet can be
split in to segments and can be distributed across different nodes (Worker Nodes )
within the network to be processed in parallel which in turn gives more performance.
Reliability and Capability for to be upgradable where more resources can be added
easily, this also reduces the cost factor.
SHARED NOTHING
Notes: For any Big data File, Slice the File into blocks then those blocks will be spreaded into
the available worker nodes. Hint: n nodes (They are not necessary to be physical nodes but
we can deploy n-physical node with m-vm (virtual nodes/machines) to act finally as a single
Cluster. Hint: each node takes one or more block (depending on the size)
SCENARIO
8. Selecting a Modeling Technique [6]
DEVELOP YOUR USE CASE [6]
“Formulate a Data-Driven Use Case
Hi-level description and objectives of the use case
Challenges addressed by the use case
Pain points and impact of each challenge
Goals, success criteria, constraints and assumptions
Available data, data sources and required resources
Modeling approach for each challenge
Overall model structure & workflow
Application of the use case into operational solution”
9. STRUCTURED DATA [6]
“Commonly refers to Database Tables with well defined columns structure including
data types and specifications It might also include other non-database managed
formats like OLAP Cubes, csv files and fixed column files as long as they are
consistently generated. i.e. exported from database, generated by ATM
machine…etc”
UNSTRUCTURED DATA [6]
“Data NOT following well defined structure either because of the nature of data
generation or the nature of the data format. Most of the data generated around the
globe is unstructured data with different degree:
Semi-structured: XML log files, HTML content
Quasi-structured: query strings in websites URLs, log events/alerts
Unstructured: text, pdf, word, social feeds, web content, images, video”
Img src: http://bigdata.black/infrastructure/storage/unstructured-data
“Unfortunately, it’s often very difficult to analyze unstructured data. To help with the
problem, organizations have turned to a number of different software solutions designed
to search unstructured data and extract important information. The primary benefit of
these tools is the ability to glean actionable information that can help a business succeed
in a competitive environment. Because the volume of unstructured data is growing so
rapidly, many enterprises also turn to technological solutions to help them better manage
and store their unstructured data. These can include hardware or software solutions that
enable them to make the most efficient use of their available storage space. “ [7]
12. Telecom: Case Study
Leveraging Data to better satisfy Understand Customers
needs ,Churn prevention
Monitor and Visualize all kind of site and services Alarms,
solve KPIs problems, and predict insights almost in realtime
Predictive Maintenance
15. References
1:Gartner Says Big Data Creates Big Jobs
2:SaaS-Big Data! What it is and why it matters
3:IBM-What is big data? More than volume, velocity and variety
4:Verizon-BIG DATA: HYPE OR REALITY?
5:Top Trends in the Gartner Hype Cycle for Emerging Technologies, 2017
6: Digital Transformation Industry Perspective, Eng.Hisham
7:Unstructured Data: BIGDATA
https://www.linkedin.com/in/mrastro