O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Big Data Driven Solutions to Combat Covid' 19

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Open Source Tools for Big Data
Open Source Tools for Big Data
Carregando em…3
×

Confira estes a seguir

1 de 53 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Big Data Driven Solutions to Combat Covid' 19 (20)

Anúncio

Mais recentes (20)

Big Data Driven Solutions to Combat Covid' 19

  1. 1. Talk on Big Data Driven Solutions to Combat Covid’ 19 National Level Webinar at Ethiraj College for Women (Autonomous), Chennai. Dr.S.Balakrishnan, Professor and Head, Department of Computer Science and Business Systems, Sri Krishna College of Engineering and Technology, Coimbatore, Tamilnadu. 1
  2. 2. OUTLINE  Introduction  Big Data Preparation  Types of Tools Used in Big-data  Top Big Data Technologies  Research Projects
  3. 3. INTRODUCTION  Big Data may well be the Next Big Thing in the IT world.  Big data burst upon the scene in the first decade of the 21st century.  The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.  Like many new information technologies,  big data can bring about dramatic cost reductions,  Substantial improvements in the time required to perform a computing task, or  new product and service offerings. 3
  4. 4. WHAT IS BIG DATA?  ‘Big Data’ is similar to ‘small data’, but bigger in size.  but having data bigger it requires different approaches:  Techniques, tools and architecture  an aim to solve new problems or old problems in a better way  Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques. 4
  5. 5. WHAT IS BIG DATA  Walmart handles more than 1 million customer transactions every hour. • Facebook handles 40 billion photos from its user base. • Decoding the human genome originally took 10 years to process; now it can be achieved in one week. 5
  6. 6. SOME MAKE IT 4V’S 6
  7. 7. 7
  8. 8. 8
  9. 9. WHY BIG DATA 9  Growth of Big Data is needed – Increase of storage capacities – Increase of processing power – Availability of data(different data types)
  10. 10. WHY BIG DATA 10 • FB generates 10TB daily • Twitter generates 7TB of data Daily • IBM claims 90% of today’s stored data was generated in just the last two years.
  11. 11. HOW IS BIG DATA DIFFERENT? 11 1) Automatically generated by a machine (e.g. Sensor embedded in an engine) 2) Typically an entirely new source of data (e.g. Use of the internet) 3) Not designed to be friendly (e.g. Text streams) 14
  12. 12. BIG DATA SOURCES 12 Users Application Systems Sensors Large and growing files (Big data files)
  13. 13. DATA GENERATION POINTS EXAMPLES 13 Mobile Devices Microphones Readers/Scanners Science facilities Programs/ Software Social Media Cameras
  14. 14. THE STRUCTURE OF BIG DATA 14  Structured • Most traditional data sources  Semi-structured • Many sources of big data  Unstructured • Video data, audio data 11
  15. 15. BIG DATA TECHNOLOGY 15
  16. 16. BIG DATA PREPARATION Perception vs. Reality  The perception is that you spend most of your time on analytics.  But in reality, you will devote much more time and effort on importing, profiling, cleansing, repairing, standardizing, and enriching your data. 16
  17. 17. DATA CLEANSING  Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.  Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting.  After cleansing, a data set will be consistent with other similar data sets in the system 17
  18. 18. DATA CLEANSING VS VALIDATION Cleansing  Data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities Validation  The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records). 18
  19. 19.  Data cleansing may also involve activities like, harmonization of data, and standardization of data.  For example, harmonization of short codes (st, rd, etc.) to actual words (street, road, etcetera).  Standardization of data is a means of changing a reference data set to a new standard, ex, use of standard codes. 19
  20. 20. WHY DATA PREPARATION IS NEEDED  It significantly reduces the amount of time needed to ingest and prepare new data sets for multiple downstream processes.  Also it shapes and improves your business data, and render your ecosystem simple, scalable, and automated. 20
  21. 21. DATA BEFORE PROCESSING  You work with a mishmash of data sources.  Your content will be inconsistent, incomplete and in a variety of formats.  It takes you weeks to process your data and write custom scripts to clean up the mess.  It needs efficient strategy to harvest and analyze data from social media and sales transactions.  You will have only a vague idea of the categories of information that your data might provide. 21
  22. 22. DATA AFTER PROCESSING  Provides you with a large set of data repair, transformation, and enrichment options that require zero coding or scripting.  Enables you to see data transformations and the result of script automation in real time with a set of smart and interactive tools and features 22
  23. 23. PREPARATION LIFE CYCLE 23
  24. 24. INGEST  What are your data sources?  Are they office documents, social media, or click stream logs? If so, you need to ingest your data before you can effectively analyze and enrich it.  To make sense of all the data you have, you must define a structure and correlate the disparate data sets.  This important step involves both understanding and standardizing your data. 24
  25. 25. WHAT YOU CAN DO TO INGEST AND MEND YOUR DATA  Statistical Profiling: Create standard statistical analysis of numerical data, and frequency and term analysis of text data.  Process: Handle multiple formats of data sources, whether their content is structured, semi-structured, or unstructured.  Cleanse: Remove nonessential characters and standardize date formats.  Repair: Find and fix inconsistencies.  Detect Schema: Identify schema and metadata that is explicitly defined in headers, fields, or tags.  Identify Duplicates: Find and flag duplicates in your data so you can reduce the size of your data pool. 25
  26. 26.  After you’ve cleansed your data, you can leverage any patterns and knowledge-based classifications to understand the domains found in your data sets.  Use the wide variety of known categories and vast array of reference data to analyze and recognize content without relying on any metadata.  After classification of data sets, enrich your data sets with related entities from the reference knowledge service, and extract embedded entities found in your data. This semantically enriches and correlates your data. ENRICH 26
  27. 27. PUBLISH 27
  28. 28. GOVERN  As you ingest, enrich, and publish your data, Cloud Service providers provides a user interface driven, intuitive Dashboard page to monitor all transform activity on your data sets. 28
  29. 29. AUTOMATE Automate the Process  To automate the process.  First, you can use the scheduler to set your transformations to run on a daily, weekly, or monthly basis against a pre- determined data source.  Second, By using APIs you can automate the entire data preparation process, from file movement to preparation to publishing. 29
  30. 30. TYPES OF TOOLS USED IN BIG-DATA 30  Where processing is hosted? Distributed Servers / Cloud (e.g. Amazon EC2)  Where data is stored? Distributed Storage (e.g. Amazon S3)  What is the programming model? Distributed Processing (e.g. MapReduce)  How data is stored & indexed? High-performance schema-free databases (e.g. MongoDB)  What operations are performed on data? Analytic / Semantic Processing
  31. 31. TYPES OF BIG DATA TECHNOLOGIES  Big Data Technology is mainly classified into two types:  Operational Big Data Technologies  all about the normal day to day data that we generate.  Analytical Big Data Technologies  like the advanced version of Big Data Technologies.  little complex than the Operational Big Data. 31
  32. 32. A FEW EXAMPLES OF OPERATIONAL BIG DATA TECHNOLOGIES ARE AS FOLLOWS: 32
  33. 33. FEW EXAMPLES OF ANALYTICAL BIG DATA TECHNOLOGIES ARE AS FOLLOWS: 33
  34. 34. 34 BIG DATA ANALYTICS AND DATA SCIENCES
  35. 35. TOP BIG DATA TECHNOLOGIES  Top big data technologies are divided into 4 fields which are classified as follows:  Data Storage  Data Mining  Data Analytics  Data Visualization 35
  36. 36. 36
  37. 37. DATA STORAGE - HADOOP  Hadoop Framework was designed to store and process data in a Distributed Data Processing Environment with commodity hardware with a simple programming model.  Store and Analyse the data present in different machines with High Speeds and Low Costs.  Developed by: Apache Software Foundation in the year 2011 10th of Dec.  Written in: JAVA  Current stable version: Hadoop 3.11 37
  38. 38. COMPANIES USING HADOOP: 38 COMPANIES USING MONGODB: COMPANIES USING RAINSTOR:
  39. 39. DATA MINING - PRESTO  open source Distributed SQL Query Engine for running Interactive Analytic Queries against data sources of all sizes ranging from Gigabytes to Petabytes.  allows querying data in Hive, Cassandra, Relational Databases and Proprietary Data Stores. Developed by: Apache Foundation in the year 2013. Written in: JAVA Current stable version: Presto 0.22  39
  40. 40. COMPANIES USING PRESTO: 40
  41. 41. DATA ANALYTICS - KAFKA  Apache Kafka is a Distributed Streaming platform.  A streaming platform has Three Key Capabilities that are as follows:  Publisher  Subscriber  Consumer  This is similar to a Message Queue or an Enterprise Messaging System.  Developed by: Apache Software Foundation in the year 2011  Written in: Scala, JAVA  Current stable version: Apache Kafka 2.2.0 41
  42. 42. COMPANIES USING KAFKA: 42 OTHER DATA ANALYTICS TOOLS SPLUNK KNIME SPARK R-LANGUAGE BLOCKCHAIN
  43. 43. DATA VISUALIZATION - TABLEAU  Tableau is a Powerful and Fastest growing Data Visualization tool used in the Business Intelligence Industry.  Data analysis is very fast with Tableau and the Visualizations created are in the form of Dashboards and Worksheets.  Developed by: TableAU 2013 May 17th  Written in: JAVA, C++, Python, C  Current stable version: TableAU 8.2 43
  44. 44. COMPANIES USING TABLEAU: 44 OTHER DATA VISUALIZATION TOOL: PLOTLY
  45. 45. EMERGING BIG DATA TECHNOLOGIES - TENSORFLOW  has a Comprehensive, Flexible Ecosystem of tools, Libraries and Community resources that lets Researchers push the state-of-the-art in Machine Learning and Developers  can easily build and deploy Machine Learning powered applications.  Developed by: Google Brain Team in the year 2019  Written in: Python, C++, CUDA  Current stable version: TensorFlow 2.0 beta 45
  46. 46. COMPANIES USING TENSORFLOW: 46 OTHER EMERGING TECHNOLGY TOOLS BEAM DOCKER AIRFLOW KUBERNETES
  47. 47. APPLICATION OF BIG DATA ANALYTICS 47 Homeland Security Smarter Healthcare Multi-channel sales Telecom Manufacturing TrafficControl Trading Analytics Search Quality
  48. 48. RISKS OF BIG DATA 48 • Will be so overwhelmed • Need the right people and solve the right problems • Costs escalate too fast • Is n’t necessary to capture 100% • Many sources of big data is privacy • self-regulation (data compression) • Legal regulation 20
  49. 49. HOW BIG DATA IMPACTS ON IT 49  Big data is a troublesome force presenting opportunities with challenges to IT organizations.  By 2015 4.4 million IT jobs in Big Data ; 1.9 million is in US itself  In 2017, Data scientist’s was No. 1 Job in the Harvard’s ranking.
  50. 50. BENEFITS OF BIG DATA 50 • Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data warehouse, It’s about the ability to make better decisions and take meaningful actions at the right time. • Fast forward to the present and technologies like Hadoop give you the scale and flexibility to store data before you know how you are going to process it. • Technologies such as MapReduce,Hive and Impala enable you to run queries without changing the data structures underneath.
  51. 51. RESEARCH PROJECTS RELATED TO COVID’19 51 •Project 1: •World-wide COVID-19 Outbreak Data Analysis and Prediction •Methods:  Real-time data query is done and visualized, then the queried data is used for Susceptible-Exposed-Infectious- Recovered (SEIR) predictive modelling.  SEIR modelling to forecast COVID-19 outbreak within and outside of China based on daily observations.  Also analyze the queried news, and classify the news into negative and positive sentiments, to understand the influence of the news to people’s behavior both politically and economically.
  52. 52. RESEARCH PROJECTS 52 •Project 2: • Short-Term Applications of Artificial Intelligence and Big Data: Tracking and Diagnosing COVID-19 Cases. •Project 3: •Short-Term Applications of Artificial Intelligence and Big Data: A Quick and Effective Pandemic Alert. •Project 4: •Modeling of disease activity, potential growth and areas of spread. •Project 5: •Modeling of the utility of operating theaters and clinics with manpower projections
  53. 53. THANK YOU 53

×