Anúncio

Big Data World

Lead Software Engineer em Maersk
21 de Jul de 2016
Anúncio

Mais conteúdo relacionado

Anúncio

Último(20)

Anúncio

Big Data World

  1. Big Data World Hossein Zahed www.hzahed.com www.linkedin.com/in/hosseinzahed 1
  2. Table of Contents • Definitions • Big Data 3V's • Internet Stats • Applications & Examples • Data Science Areas • Identities and Skills • Data Work Flow • Challenges • Data Generation • Data Structure • Cloud Service Providers 2 • Hadoop Ecosystem • Data Visualization • Data Analytics Methods • Data Trends • Programming Languages • NoSQL Databases • Interesting Facts • Interesting Insights • Data Sources • Keywords & Glossary • References
  3. Big Data - Definitions 1. The first documented use of the term “big data” appeared in a 1997 paper by scientists at NASA, describing the problem they had with visualization (i.e. computer graphics) which “provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.” (NASA) 2. Data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges. (Oxford English Dictionary) 3. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. (Wikipedia) 3
  4. Big Data – Every 60 Seconds on the Internet 4
  5. Big Data – Basic 3V’s Big Data: Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions. (Google) 5 Velocity Variety Volume
  6. Big Data – Basic 3V’s 1. Volume: Huge amount of data (Terabytes of Records, Transactions, Tables, Files) 2. Velocity: High rate of data and information flowing into and out of our systems (Batch, Real-time, Streams, Near-time) 3. Variety: Complexity, thousands or more features per data item (Structured, Unstructured, Semi-Structured) 6
  7. Big Data – MoreV’s • Veracity: Accuracy and uncertainty of data • Validity: Data quality, clean/unclean data • Variability: Constantly changing/dynamic data • Value: The potential business value/ROI of data • Venue: Distributed, heterogeneous data from multiple platforms • Vocabulary: Schema, data models, semantics, ontologies, taxonomies, context based • Vagueness: Confusion over the meaning of data • Visibility: Open/Secure data • Visualization: Presentation of data in a readable and accessible way 7
  8. Big Data – Moore’s Law Physical capacity and performance of computers double about every two years! 8
  9. Big Data – Gartner’s EmergingTechnology (2015) 9
  10. Big Data – Internet Stats • The data volumes are exploding, more data has been created in the past two years than in the entire previous history of the human race. • Data is growing faster than ever before and by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet. • By then, our accumulated digital universe of data will grow from 4.4 zettabytes (1021) today to around 44 zettabytes, or 44 trillion gigabytes. • Every second we create new data. For example, we perform 40,000 search queries every second (on Google alone), which makes it 3.5 billion searches per day and 1.2 trillion searches per year. • In Aug 2015, over 1 billion people used Facebook FB +2.39% in a single day. 10
  11. Big Data – Internet Stats – Continued • Facebook users send on average 31.25 million messages and view 2.77 million videos every minute. • We are seeing a massive growth in video and photo data, where every minute up to 300 hours of video are uploaded to YouTube alone. • In 2015, a staggering 1 trillion photos will be taken and billions of them will be shared online. By 2017, nearly 80% of photos will be taken on smart phones. • This year, over 1.4 billion smart phones will be shipped – all packed with sensors capable of collecting all kinds of data, not to mention the data the users create themselves. • By 2020, we will have over 6.1 billion smartphone users globally (overtaking basic fixed phone subscriptions). 11
  12. Internet Stats - Continued • Within five years there will be over 50 billion smart connected devices in the world, all developed to collect, analyze and share data. • By 2020, at least a third of all data will pass through the cloud (a network of servers connected over the Internet). • Distributed computing (performing computing tasks using a network of computers in the cloud) is very real. Google GOOGL +0.63% uses it every day to involve about 1,000 computers in answering a single search query, which takes no more than 0.2 seconds to complete. • The Hadoop (open source software for distributed computing) market is forecast to grow at a compound annual growth rate 58% surpassing $1 billion by 2020. • Estimates suggest that by better integrating big data, healthcare could save as much as $300 billion a year — that’s equal to reducing costs by $1000 a year for every man, woman, and child. 12
  13. Internet Stats - Continued • Estimates suggest that by better integrating big data, healthcare could save as much as $300 billion a year — that’s equal to reducing costs by $1000 a year for every man, woman, and child. • The White House has already invested more than $200 million in big data projects. • For a typical Fortune 1000 company, just a 10% increase in data accessibility will result in more than $65 million additional net income. • Retailers who leverage the full power of big data could increase their operating margins by as much as 60%. • 73% of organizations have already invested or plan to invest in big data by 2016 • Favorite fact: At the moment less than 0.5% of all data is ever analyzed and used, just imagine the potential here. • More stats: http://www.internetlivestats.com 13
  14. Big Data – Consumer Applications • Google Search! • IPhone Siri • Microsoft Cortana • Amazon Suggestions • Spotify Suggestions • Yelp Recommendations • Netflix Recommendations • Google Now! 14
  15. Big Data – Business Applications • Google Ads Searches: Showing relevant ads to users • Predictive Marketing: consumer behavior, users demographic info • Banking: Fraud detection, risk reporting, customer data analysis • Financial: Stocks prediction, Forex • Fraud Detection: spam filtering, online payments • Health: self-aware medics, sports analysis, genomics, health records • Smart Cities: IoT, transportation, traffic, governance, energy, economy • Social Media: friends, topics, videos recommendations • Education: LMS tracks & logs, time spent on subjects 15
  16. Big Data – ResearchApplications • Google Trends: Flu, Zika & Ebola virus, racial justice, supporting refugees and migrant crisis • National Institute of Health: Brain Innovative Neurotechnologies to create a full map of brain functionalities • NASA: Kepler space telescope searching for exoplanets/planets out side of our solar system • Facebook Graphs: Revealing relationships, six-degrees of separation, psychological and personality data • Google Books: Ngram Viewer, History of words, their usage, different meanings 16
  17. Big Data – Example 1 – UPS Post • Insight: Optimize the routing again, predict the maintenance requirements of vehicles. • System: ORION database: engine performance, speed, number of stops, mileage, miles per gallon, GPS, driver behavior, safety habits, emissions, fuel consumption, deliveries, customers, addresses, routes. 250 million+ data points. • Analysis: Advanced mathematical models that provide additional optimization and navigational capabilities to make drivers more efficient. • Result: Saved over 39 million gallons of fuel, avoided 364 million miles, reduced engine idle time by 10 million minutes. 17
  18. Big Data – Example 2 –Walmart • Insight: Customers stock up on certain products in the days leading up to predicted hurricanes. • System: RetailLinksystem records sale, triggers reordering, scheduling, and delivery. Back-office scanners track shipments. Partners use RFID technology to track and coordinate inventories. Data includes daily sales, shipments, returns, purchase orders, invoices. • Analysis: Mines data to get its product mix right under all sorts of varying environmental conditions. • Result: Revenues greater thananyfirm in the US. RFID boosted sales 20%. Gillette increased sales 19%. 18
  19. Big Data – Example 3 – Fraud at eBay • Insight: Fraud spikes mid-week, enabling fraudsters to receive goods by the weekend. Basic fraud pattern= long-distance, high-dollar, expedited shipping. • System: Names, email, addresses, device fingerprinting, IP address, geolocation lookups, time zones, countries in Oracle database of 1.3 billion entries. • Analysis: Run transactions against 600 rules, 20- plus machine learning algorithms. Regularly tweak the fraud rules. • Result: In 2014, prevented $55-million worth of fraudulent transactions. 19
  20. Big Data – Example 4 – Kaiser Permanente • Insight: Kaiser Permanente: HealthConnectexchanges data across all facilities, promotes electronic records. Improved outcomes in cardiovascular disease and saved $1 billion from reduced office visits and lab tests. • System: Pharmaceutical companies have aggregated years of research and development data into medical databases, payorsand providers have digitized patient records, public stakeholders have opened data from clinical trials. 4 billion petabytes. • Analysis: Determine whether standard protocol for a disease produces optimal results. • Result: $300 billion to $450 billion in reduced health- care spending. 20
  21. Big Data vs Small Data 21 Aspect Small Data Big Data Goals Have specific goal May have a goal Location On a single computer On the cloud (multiple servers) Structure Highly structured Semi-structured/unstructured File Types SQL, Excel Documents, multimedia, graphs, tables Data Preparation Prepared by one user Prepared, analyzed, used by different group of users Longevity Short time period Continues for a long time Measurements Single unit (cm) Multiple units (cm, inch,…) Reproducibility Usually reproducible Rarely reproducibility Lost Costs Limited Huge amount Introspection Clear meaning Complex meaning, meaningless Analysis Can be analyzed at once Needs an analysis procedure
  22. Big Data – Data ScienceVenn Diagram 22
  23. Big Data – Professional Identities 23 Data Developer Developer Engineer Data Researcher Researcher Scientist Statistician Data Creative Jack of All Trades Artist Hacker Data Businessperson Leader Businessperson Entrepreneur
  24. Big Data – Five Skill Groups 24 Business ML / Big Data Math / OR Programming Statistics Product Development Unstructured Data Optimization System Administration Visualization Business Structured Data Math Back-End Programming Temporal Statistics Machine Learning Graphic Models Frond-End Programming Surveys and Marketing Big and Distributed Data Bayesian / Monte Carlo Statistics Spatial Statistics Algorithms Science Simulation Data Manipulation Classical Statistics
  25. Big Data – Crossed Identities and Skills 25
  26. Big Data – Scientific Data • Genetic Data (1V): High Volume of data in a structured way • Earthquake Prediction (1V): High Velocity of data, almost real-time • Facial Recognition (1V): High Variety of data • Jet Engine Sensors (2Vs): High Volume + High Velocity (20TB/hour data) • Surveillance Video (2Vs): High Velocity + High Variety of data streaming • Google Books (2Vs): High Volume + High Variety of data (30 Million books) 26
  27. Big Data – Data ScienceWork Flow 27 Start
  28. Big Data – Common Challenges • Anonymity: danger of de-anonymizing public data, social network graphs, medical data,… • Confidentiality: trying to protect data and access levels, storing unimportant data and it’s responsibility • Data Quality: Nearly 95% of spreadsheets have errors • Incomplete or corrupted data • Duplicate records • Typographical errors • Data without context/missing context • Incomplete transformations • Data conversion errors 28
  29. Big Data – Security Challenges • Secure computations in distributed programming frameworks • Security best practices for non-relation data stores • Secure data storage and transaction logs • End-point input validation/filtering • Real-time security/compliance monitoring • Scalable and composable privacy-preserving data mining and analytics • Cryptographically enforced access control and secure communication • Granular access control • Granular audits • Data provenance 29
  30. Big Data – Human Generated Data • Intentional Data: Chats, photos, videos, comments, likes, web searches, emails, cell phone call, text messages, online purchases,… • Meta Data: Data about data, second order data • Photo metadata taken by cameras • Cell phones time and location • Emails To, From, CC, BCC • Social networks connectivity's • Twitter collects 150 pieces of metadata for each tweet 30
  31. Big Data – IPhone 4s Photo EXIF Metadata 31 ExifToolVersion Number : 8.68 File Name : IMG_1031.JPG Directory : . File Size : 3.1 MB File Modification Date/Time : 2011:10:05 01:43:44-07:00 File Permissions : rw-r--r-- FileType : JPEG MIME Type : image/jpeg Exif Byte Order : Big-endian (Motorola, MM) Make : Apple Camera Model Name : iPhone 4S Orientation : Rotate 180 X Resolution : 72Y Resolution : 72 ResolutionUnit : inches Software : 5.0 Modify Date : 2011:08:24 13:13:33YCb Cr Positioning : Centered ExposureTime : 1/286 F Number : 2.4 Exposure Program : Program AE ISO : 64 ExifVersion : 0221 Date/TimeOriginal : 2011:08:24 13:13:33 Create Date : 2011:08:24 13:13:33 ComponentsConfiguration :Y,Cb, Cr, - Shutter SpeedValue : 1/286 ApertureValue : 2.4 BrightnessValue : 6.992671928 Metering Mode : Multi-segment Flash : Auto, Did not fire Focal Length : 4.3 mm SubjectArea : 1631 1223 881 881 FlashpixVersion : 0100 Color Space : sRGB Exif ImageWidth : 3264 Exif Image Height : 2448 Sensing Method : One-chip color area Exposure Mode : AutoWhite Balance : Auto Focal Length In 35mm Format : 35 mm SceneCaptureType : Standard Sharpness : NormalGPS Latitude Ref : North GPS Longitude Ref : West GPSAltitude Ref : Above Sea Level GPSTime Stamp : 21:08:30 GPS Img Direction Ref :True NorthGPS Img Direction : 346.4727273 Compression : JPEG (old-style) ThumbnailOffset : 908Thumbnail Length : 12311 ImageWidth : 3264 Image Height : 2448 Encoding Process : Baseline DCT, Huffman coding Bits Per Sample : 8 Color Components : 3YCb Cr Sub Sampling :YCbCr4:2:0 (2 2) Aperture : 2.4 GPS Altitude : 1222 m Above Sea LevelGPS Latitude : 37 deg 44' 10.80" N GPS Longitude : 119 deg 35' 58.80" W GPS Position : 37 deg 44' 10.80" N, 119 deg 35' 58.80"W Image Size : 3264x2448 Scale FactorTo 35 mm Equivalent: 8.2 Shutter Speed : 1/286Thumbnail Image : (Binary data 12311 bytes, use -b option to extract)CircleOf Confusion : 0.004 mm FieldOfView : 54.4 deg Focal Length : 4.3 mm (35 mm equivalent: 35.0 mm) Hyperfocal Distance : 2.08 m LightValue : 11.3
  32. Big Data – Computer Generated Data • Sources: Cell phones connecting to towers, Satellite radio, GPS connecting, Wi-Fi connections, Web Crawlers,… • Internet of Things (IoT): Information collected an transmitted via IoT devices, Production Lines, Smart Meters, Environmental Monitoring, Industrial Applications, Infrastructure Management, Energy Management, Medical and Healthcare Systems, Smart Buildings,… • Machine to Machine: Server to Server connections, Web Services, Cloud Computations, Real-Time Analytics, Network Monitoring, Routing and Switching,… 32
  33. Big Data – Structured vs. Unstructured Data 33
  34. Big Data – Structured vs. Unstructured Data 34 Features Structured Data Unstructured Data Representation Discrete rows and columns Less defined boundaries and easily addressable Storage Rational Databases or Spreadsheets Unmanaged file structured Metadata Syntax Semantics Integration Tools ETL or ELT Batch processing or manual data entry that involves codes Standard SQL, ADO.NET, ODBC,... OpenXML, JSON, SMTP, SMS, CSV,... Databases MSSQL, Oracle, Excel,… Hadoop, HDInsight, MongoDB,… Content Typically Text Text, Images, Audio, Video, Documents
  35. Big Data – Cloud Computing Services 35 SaaS / DaaS PaaS IaaS
  36. Big Data – Cloud Computing Services Continued • IaaS: Infrastructure as a Service • Servers, Virtual Machines, Storage, Load Balancers, Firewalls, Network • PaaS: Platform as a Service • Web Servers, Databases, Development Tools, Execution Runtime • SaaS: Software as a Service • CRM, ERP, Email, Virtual Desktop, Communications, Games • DaaS: Data as a Service (Free or Commercial) • Stocks, Forex, Google Map, Reddit, Twitter Demographic Data 36
  37. Big Data – Cloud Service Providers • Google Big Data Solutions • Amazon Public Elastic Cloud • Microsoft Azure • OpenStack by Rackspace and NASA • IBM Big Data Solutions • Cloudera • Oracle Cloud Platform • Hortonworks • SAP Big Data 37
  38. Big Data – Cloud Providers Comparison 38
  39. Big Data – Hadoop • Apache Hadoop (pronunciation: /həˈduːp/) is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. (Wikipedia) • History: Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache Software Foundation. • Hadoop is a Free and Open Source Project 39
  40. Big Data – Hadoop EcosystemArchitecture 40
  41. Big Data – Hadoop Components • HDFS: The Hadoop distributed File System, used to store files across many computers • MapReduce: • Map splits a task into pieces • Reduce combines the output • Has been replaced by YARN (Known as MapReduce 2) • YARN: Can do Batch Processing like MapReduce and also Stream Processing and Graph Processing unlike MapReduce 41
  42. Big Data – Hadoop Components Continued • Pig: Writes MapReduce programs, uses the Pig Latin programming language • Hive: Summarizes queries, Analyzes data, uses the HiveQL programming language • HBase: A NoSQL, not relational, not only SQL database • Storm: Processing and Streaming data • Spark: In Memory Processing (HDD to RAM) • Giraph: Graph Processing for Social Networks data 42
  43. Big Data – Landscape 43 http://www.hzahed.com/post/big-data-landscape
  44. Big Data – Microsoft HDInsight 44
  45. Big Data – Google Big Data Cloud 45
  46. Big Data – Amazon AWS Big Data 46
  47. Big Data –Who Uses Hadoop • Google • Yahoo! • LinkedIn • Facebook • Quantcast • Amazon • IBM 47 • ISI • Spotify • Twitter • Adobe • Ebay • Alibaba • Many others
  48. Big Data – ETL Definition • ETL: Stands for Extract, Transform, Load • Extract: The process of pulling data from storage such as a database • Transform: The process of putting data into a common format • Load: The process of loading data into software for analysis 48 Extract Transform Load
  49. Big Data – ETL in Hadoop • ETL in Hadoop works differently from common databases • Data starts and ends in Hadoop • Hadoop can handle different formats • It doesn’t require as much inspection • No need to be aware of or worry about ETL processes in Hadoop • Make it a point to inspect data 49
  50. Big Data – Monitoring & Anomaly • Monitoring • Detects specific events • Needs specific criterion in advance • Triggers automatic response • Anomaly • Notifies of “unusual activity” • Based on flexible criterial • Doesn’t trigger a response • Instead, invites inspection 50
  51. Big Data –Visualization – Human vs. Computers • Computers spot certain patterns • Computers excel at predictive models • Computers excel at data mining • Humans perceive and interpret better • Humans vision still plays and important role • Humans identify visual patterns • Humans identify anomalies • Humans seeing patterns across groups • Humans interpret content of images better • Humans identify Gestalt Test better 51
  52. Big Data –Visualization – GestaltTest 52
  53. Big Data –Visualization – Best Practices • Prettier graphs are not always better • Never use a false third dimension • Animated and interactive graphs can be distracting • The goal of data visualization is insight • Use proper chart formats for visualization • Choosing the right color scheme (Qualitative, Sequential, Diverging) • Make sure chart alone can tell your story 53
  54. Big Data – Microsoft Excel Role • Excel is the most common data tool • Millions of people use it and know how to deal with it • Professional data miners use it • Excel can do real data science on its own • ODBC interfaces can connect Excel directly to Hadoop • Excel is great for sharing data results • Excel includes interactive PivotTables, Sortable Worksheets, Graphics and Charts 54
  55. Big Data – Data Analytics (DA) Methods • Machine Learning (ML) • Pattern Recognition (PR) • Data Mining (DM) • Natural Language Processing (NLP) • Information Retrieval (IR) • Text Mining (TM) • Predictive Analytics • Business Intelligence (BI) • Prescriptive Analytics 55
  56. Big Data – Machine Learning (ML) • Definition: Machine Learning (LM) is a subfield of computer science (more particularly soft computing) that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed". (Wikipedia) • Examples: Recommendations, Classifications, Line Regression, Clustering, Neural Networks 56
  57. Big Data – Pattern Recognition (PR) • Definition: Pattern Recognition (PR) is a branch of machine learning that focuses on the recognition of patterns and regularities in data, although it is in some cases considered to be nearly synonymous with machine learning. Pattern recognition systems are in many cases trained from labeled "training" data (supervised learning), but when no labeled data are available other algorithms can be used to discover previously unknown patterns (unsupervised learning). (Wikipedia) • Examples: Face detection, fingerprint verification, screening for tumors and cancers, shape recognition, navigation systems 57
  58. Big Data – Data Mining (DM) • Definition: Data Mining (DM) is an interdisciplinary subfield of computer science. It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. (Wikipedia) • Examples: Anomaly Detection, Association Rule Learning, Clustering, Classification, Regression, Summarization 58
  59. Big Data – Natural Language Processing (NLP) • Definition: Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. (Wikipedia) • Examples: Natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation. (SIRI, Cortana) 59
  60. Big Data – Information Retrieval (IR) • Definition: Information Retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on or on full-text (or other content-based) indexing. (Wikipedia) • Examples: Automated information retrieval systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines (Google & Bing) are the most visible IR applications. 60
  61. Big Data –Text Mining (TM) • Definition: Text Mining (TM) also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. (Wikipedia) • Examples: Enterprise Business Intelligence/Data Mining, Competitive Intelligence, National Security/Intelligence, Publishing, Social Media Monitoring, Search/Information Access, Natural Language/Semantic Toolkit or Service, Sentiment Analysis Tools, Listening Platforms 61
  62. Big Data – Predictive Analytics • Definition: Predictive Analytics encompasses a variety of statistical techniques from predictive modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future or otherwise unknown events. In business, predictive models exploit patterns found in historical and transactional data to identify risks and opportunities. Models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding decision making for candidate transactions. (Wikipedia) • Examples: Actuarial Science, Marketing, Financial Services, Insurance, Telecommunications, Retail, Travel, Healthcare, Child Protection, Pharmaceuticals, Capacity Planning 62
  63. Big Data – Business Intelligence (BI) • Definition: Business Intelligence (BI) can be described as "a set of techniques and tools for the acquisition and transformation of raw data into meaningful and useful information for business analysis purposes". The term "data surfacing" is also more often associated with BI functionality. BI technologies are capable of handling large amounts of unstructured data to help identify, develop and otherwise create new strategic business opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability. (Wikipedia) • Examples: Measurement, Analytics, Enterprise Reporting, Collaboration Platform, Knowledge management 63
  64. Big Data – Prescriptive Analytics • Definition: Prescriptive analytics is the third and final phase of business analytics (BA) which includes descriptive, predictive and prescriptive analytics. Predictive analytics answers the question what will happen. This is when historical performance data is combined with rules, algorithms, and occasionally external data to determine the probable future outcome of an event or the likelihood of a situation occurring. The final phase is prescriptive analytics, which goes beyond predicting future outcomes by also suggesting actions to benefit from the predictions and showing the implications of each decision option. (Wikipedia) 64
  65. Big Data – Prescriptive Analytics Continued 65
  66. Big Data – Prescriptive Analytics Continued 66
  67. Big Data – InterestTrends 67
  68. Big Data – InterestTrends 68
  69. Big Data – Programming Languages 69 6.3% 8.1% 8.5% 8.8% 12.4% 30.6% 35.0% 36.4% 49.0% MATLAB SPSS PIG / HIVEQL UNIX SHELL JAVA SQL PYTHON SAS R
  70. Big Data – NoSQL Databases 70 Database Type Vendors Wide Column Store Hadoop HBase, Cassandra, Hortonworks, Cloudera, Amazon SimpleDB, IBM Informix Document Store Elastic, MongoDB, Azure DocumentDB, Terrastore, JSON ODM Key Value / Tuple Store Azmazon DynamoDB, Azure Table Storage, Oracle NoSQL Database, Genomu Graph Databases Neo4J, Infinite Graph, Sparksee, InfoGrid, GraphBase Multimodel Databases ArangoDB, OrientDB, RockallDB, FoundationDB Object Databases Versant, db4o, Objectivity, Startcounter, Perst, HSS Database, Magma, EyeDB, NDatabase, ObjectDB
  71. Big Data – NoSQL Databases Continued 71 Database Type Vendors Grid & Cloud Database Solutions Crate Data, Oracle Coherence, GigaSpaces, Infinispan XML Databases EMC Documentum xDB, eXist, Senda, BaseX, QizX, Berkeley DB XML Multidimensional Databases Globals, SciDB, MiniM DB, DaggerDB Multivalue Databases U2, OpenInsight, Reality, OpenQM, ESENT Event Sourcing Event Store, ES4J Time Series / Streaming Databases Axibase, Influxdata, kdb+ Other NoSQL Databases IBM Lutos, eXteremeDB, Yserial, BayesDB, GPUdb, CodernityDB
  72. Big Data – 10 Interesting Facts 1. Every 2 days we create as much information as we did from the beginning of time until 2003. 2. Over 90% of all the data in the world was created in the past 2 years. 3. It is expected that by 2020 the amount of digital information in existence will have grown from 3.2 zettabytes today to 40 zettabytes. 4. The total amount of data being captured and stored by industry doubles every 1.2 years. 5. Every minute we send 204 million emails, generate 1.8 million Facebook likes, send 278 thousand Tweets, and upload 200 thousand photos to Facebook. 72
  73. Big Data – 10 Interesting Facts 6. Google alone processes on average over 40 thousand search queries per second, making it over 3.5 billion in a single day. 7. Around 100 hours of video are uploaded to YouTube every minute and it would take you around 15 years to watch every video uploaded by users in one day. 8. Facebook users share 30 billion pieces of content between them every day. 9. AT&T is thought to hold the world’s largest volume of data in one unique database – its phone records database is 312 terabytes in size, and contains almost 2 trillion rows. 10.The amount of data transferred over mobile networks increased by 81% to 1.5 Exabyte’s (1.5 billion gigabytes) per month between 2012 and 2014. Video accounts for 53% of that total. 73
  74. Big Data – 10 Interesting Insights 1. “The world is one big data problem.” – Andrew McAfee 2. “In God we trust. All others must bring data.” – W. Edwards Deming 3. “Torture the data, and it will confess to anything.” – Ronald Coase 4. “Information is the oil of the 21st century, and analytics is the combustion engine.” - Peter Sondergaard 5. “It’s easy to lie with statistics. It’s hard to tell the truth without statistics.” – Andrejs Dunkels 74
  75. Big Data – 10 Interesting Insights 6. “The goal is to turn data into information, and information into insight.” – Carly Fiorina 7. “The most valuable commodity I know of is information.” – Gordon Gekko 8. “Data really powers everything that we do.” – Jeff Weiner 9. “Numbers have an important story to tell. They rely on you to give them a voice.” – Stephen Few 10.“Data beats emotions.” – Sean Rad 75
  76. Big Data – Free Data Sources • Google Trends: www.google.com/trends/explore • Google Finance: www.google.com/finance • Google Freebase: developers.google.com/freebase • Wikipedia Content: en.wikipedia.org/wiki/Wikipedia:Database_download • U.S. Government Open Data: www.data.gov • Quandl: www.quandl.com • World Health Organization: www.who.int/gho/database/en • Amazon Public Datasets: aws.amazon.com/datasets • Facebook Graph: developers.facebook.com/docs/graph-api • UNICEF: www.unicef.org/statistics/ 76
  77. Big Data – KeyTerms & Glossary • Algorithm • Analytics Platform • Apache Hive • Behavioral Analytics • Big Data Analytics • Business Intelligence • Cascading • Cloud Computing • Concurrency / Concurrent computing • Cluster Analysis • Comparative Analysis 77 • Internet of Things (IOT) • Machine Learning • Metadata • Natural Language Processing • Pattern Recognition • Petabyte • Predictive Analytics • Prescriptive Analytics • Semi-structured Data • Sentiment Analysis • Terabyte • Connection Analytics • Correlation Analysis • Data Analyst • Data Cleansing • Data Mining • Data Model / Data Modeling • Data Warehouse • Descriptive Analytics • ETL • Hadoop • Exabyte http://bigdata.teradata.com/US/Big-Data-Quick-Start/Glossary
  78. References • http://www.smartinsights.com/internet-marketing-statistics/happens- online-60-seconds/ • https://www.mapr.com/blog/top-10-big-data-challenges-%E2%80%93- serious-look-10-big-data-v%E2%80%99s • https://en.wikipedia.org/wiki/Moore%27s_law • http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20- mind-boggling-facts-everyone-must-read/#7f2504de6c1d • http://www.gartner.com/smarterwithgartner/whats-new-in-gartners- hype-cycle-for-emerging-technologies-2015/ • http://google.org/special-programs/ 78
  79. References - Continued • https://datafloq.com/read/ups-spends-1-billion-big-data-annually/273 • http://joelcadwell.blogspot.de/2016/01/a-data-science-solution-to- question.html • http://bigdata-madesimple.com/how-i-chose-the-right-programming- language-for-data-science • http://nosql-database.org • http://www-01.ibm.com/software/data/bigdata/ • http://www.sequentia.in/why-big-data-matters • https://www.linkedin.com/pulse/20140502105616-8781298-25- insightful-and-thought-provoking-quotes-about-big-data 79
  80. References - Continued • http://www.mckinsey.com/insights/health_systems_and_services/the_big- data_revolution_in_us_health_care • http://www.datanami.com/2015/12/21/tis-the-season-to-hunt-fraudsters-with- big-data • https://datafloq.com/read/ups-spends-1-billion-big-data-annually/273 • http://2012books.lardbucket.org/books/getting-the-most-out-of-information- systems-v1.3/s15-07-data-asset-in-action-technolog.html • http://www.gartner.com • http://bigdata.teradata.com/US/Big-Data-Quick-Start/Glossary • https://www.isaca.org/Groups/Professional-English/big-data • Analyzing the Analyzers Book – Harris, Murphy, Vaisman 80
Anúncio