You know, a great example is radio frequency ID tags (RFID). These caught lots and lots of attention when Wal-Mart was redesigning their supply chain with them, and the cost of RFID tags have come down so much, they ’ ve just proliferated all over the world. When you think about the Instrumentation characteristic of IBM ’ s Smarter Planet (Instrumented, Interconnected, and Intelligent), this is just one example of how we ’ ve become an instrumented world. On this slide you can see that in 2005, there were 1.3 billion RFID tags in circulation; this turns into 30 billion by the end of last year (2011). That ’ s a pretty significant annual growth rate to get to where we got to at the end of 2011; and again, this is just a single example of instrumentation. They are a good place to start with Big Data, because they are now ubiquitous as is the opportunity for Big Data. They are used to track cars on a toll route, food supplies for temperature transport, livestock, supplies, inventories, luggage, retail items, tickets used for transportation, you name it.
I was on an Airbus plane the other day, and do you realize that these things are hugely sensor-enabled devices that are instrumented to collect data as they operate. They also generate huge volumes of data. +CLICK+ For this particular Airbus, over a billion lines of a code and a single engine generates 10 terabytes of data every 30 minutes. And there are four engines there, right? +CLICK+ And, you know, just taking this particular plane from the UK to New York would generate 640 terabytes of data. Now stop and ponder that for a moment. Propose this amount of data injection to your client and it becomes obvious – there ’ s too much data to process, analyze, store with traditional approaches.
You can see in this slide another example of Big Data in the utilities sector: smart metering. As meter reads have transformed from every other month, to a physical read with a estimation every other month, to monthly, weekly, daily, and hourly – you’ve got an immense amount of data streaming into the enterprise as shown on this slide. Smart metering is also about point in time values, so you can spot spikes, and adjust accordingly, so data in motion is a play here too. Smart meters are smart because they can communicate – not only with the customer about their electricity usage and pricing signals, but they can also communicate with the utility to indicate if there are fluctuations in power or even accurately pinpoint an outage. For a utility company, smart meters are generating a wealth of new information that is fundamentally changing the way they interact with their customers.
The notion is that we are always sharing information about ourselves. For example, this particular Hollywood Star actually gave away the location of his house, when he heads to work, and more just by uploading a photo with GPS location enabled (the default for smartphones by the way). The full story of this is located at http://nyti.ms/917hRh. The US Army had to send guidance and requirements for military phone lockdowns because geo-positioning capabilities of service men and women’s Blackberries and iPhones gave away sensitive location information when unsuspecting service personnel upload pictures of themselves in the Iraqi desert.
Obviously, there are many other forms and sources of data. Let ’ s start with the hottest topic associated with Big Data today: social networks. Twitter generates about 12 terabytes a day of tweet data – which is every single day. Now, keep in mind, these numbers are hard to count on , so the point is that they ’ re big, right? So don ’ t fixate on the actual number because they change all the time and realize that even if these numbers are out of date in 2 years, it ’ s at a point where it ’ s too staggering to handle exclusively using traditional approaches. +CLICK+ Facebook over a year ago was generating 25 terabytes of log data every day ( Facebook log data reference: http://www.datacenterknowledge.com/archives/2009/04/17/a-look-inside-facebooks-data-center/ ) and probably about 7 to 8 terabytes of data that goes up on the Internet. +CLICK+ Google, who knows? Look at Google Plus, YouTube, Google Maps, and all that kind of stuff. So that ’ s the left hand of this chart – the social network layer. +CLICK+ Now let ’ s get back to instrumentation: there are massive amounts of proliferated technologies that allow us to be more interconnected than in the history of the world – and it just isn ’ t P2P (people to people) interconnections, it ’ s M2M (machine to machine) as well. Again, with these numbers, who cares what the current number is, I try to keep them updated, but it ’ s the point that even if they are out of date, it ’ s almost unimaginable how large these numbers are. Over 4.6 billion camera phones that leverage built-in GP S to tag the location or your photos, purpose built GPS devices, smart metres. If you recall the bridge that collapsed in Minneapolis a number of years ago in the USA, it was rebuilt with smart sensors inside it that measure the contraction and flex of the concrete based on weather conditions, ice build up, and so much more. So I didn ’ t realise how true it was when Sam P launched Smart Planet: I thought it was a marketing play. But truly the world is more instrumented, interconnected, and intelligent than it ’ s ever been and this capability allows us to address new problems and gain new insight never before thought possible and that ’ s what the Big Data opportunity is all about!
This slide shows the tweets per second (TPS) record breakers for 2011 – as you can see, the record keeps getting broken and the topics range from news, to safety, to sport, to shocking, to ‘cult’ like movie followers. The point here is that Twitter is not only growing enormously, but the range of topics is from emergency to world events to social commentary to sport to entertainment and all parts in between. Source: http://www.mediabistro.com/alltwitter/twitters-tweets-per-second-record-breakers-of-2011-infochart_b17210.
You can just +CLICK+ through this slide as another example of social media (such as Facebook and Twitter) and the valuable information that can be found within; note also in some cases, the information is SPAM and noise – and we want to be able to discard that area as well and find the signals in the noise. The reason why I am showing social media is it involves heavy text analytics – and that’s the hardest part of Big Data analytics. There are easier use cases, and the IBM platform is terrific at that for sure (such as log analysis). In addition, there are easier ways to use text analytics – for example, use it to get insight into company earnings as it pours through hundreds of pages on the web to spot trends and patterns.
Most of you know of Watson, our computing system designed to compete on the Jeopardy game show. Watson represents a breakthrough in terms of volume of information stored, and the ability to access it quickly (answering natural language questions). I think Watson is impressive, because there are many commercial uses for this technology – and the technology exists today! The game Jeopardy provides the ultimate challenge for Watson because the game’s clues involve analyzing subtle meanings, irony, riddles, and other complexities in which humans excel and computers traditionally do not. If you think about Deep Blue, the 1997 IBM machine that defeated the reigning world chess champion, Watson is yet another major leap in capability of IT systems to identify patterns, gain critical insight and enhance decision-making despite daunting complexities. While Deep Blue was amazing, it was an achievement of the application of compute power to a computationally well-defined and well-bound game: Chess. Watson, on the other hand, faces a challenge that is open-ended, defies the well-bounded mathematical formulation of a game like Chess. Watson has to operate in the near limitless, ambiguous, and high contextual domain of human language and knowledge. Watson answers a Grand Challenge: Can IBM design a computing system that rivals a human’s ability to answer questions posed in natural language by interpreting meaning and context and then retrieving, analyzing and understanding vast amounts of information in real-time? IBM Watson is a breakthrough in analytic innovation, proving that it is possible to harness vast amounts of information and rival a human’s ability to answer questions posted in natural language in real-time. But it doesn't matter how good the machine is if we don’t have good information to feed it. We live in a time where a computer can compete against humans at answering questions in plain English, based on storing, retrieving, analyzing and understanding vast amounts of information at real-time speeds. These same capabilities can enable you to improve and optimize your business, too. IBM just showed the value of putting that information to work by creating a computing system capable of competing on Jeopardy Well there ’ s a lot of technology that went into Watson – and a lot of Big Data technology in there as well. Now take a moment and think about how this iconic game show is played: you have to answer a question within three seconds. The technology used to analyze and return answers in Watson was a pre-cursor to the Streams technology, in fact, Streams was invented because that technology used in Watson wasn ’ t fast enough for some of the in-motion requirements needed by companies today. Jeopardy questions are not straight forward, they have pun and tricks to make them harder – so some of our text analytic technology with natural language processing, which is part of the IBM Big Data platform, is in there too (that ’ s yet another MAJOR DIFFERENTIATOR for IBM in Big Data: our Text Analytic Toolkit, which you will hear more about later in this presentation). It wasn ’ t always smooth sailing for Watson, the big breakthrough came when they started to use machine learning (ML), and the IBM Big Data platform will further differentiate itself from the field in 2012 when a corresponding toolkit came to market just like the text analytics toolkit. Finally, Watson had to have access to a heck of a lot of data – and Big Data technologies were used to load and index over 200 million pages of data; Watson had everything from encyclopedias, to the bible, to the world famous music and movie databases, etc. All these technologies mentioned in the previous paragraph had to work together as well. So IBM clearly has some inflection point understanding of these technologies and how to get them working together. In the case of the text analytics and machine learning – well we have to make that easier to consume because you don ’ t have the world ’ s largest commercial research organization for math at your fingertips. So we need to build tooling, and optimization, and accelerators around that and put these technologies inside consumable toolkits: which are we doing now.
+CLICK+ In 2009, we came close as a world to 0.8 zettabytes of data. Now, that ’ s a number that few people understand (a ZB is a trillion GBs!). We ’ re not used to working with numbers like this, in the same manner the DBAs you show the Airbus data generation rates from earlier in this presentation can ’ t fathom ingesting into their data centers. +CLICK+ In 2010, we crossed the 1 ZB inflection point in the world data: the game is on. +CLICK+ By the end of 2011 it ’ s estimated we are at 1.8 ZBs – so there are some pretty good growth rates years over year (YoY) +CLICK+ And so as you look forward into what the future holds, in the next decade – the Big Data era – and you can see it ’ s gong to get crazy. Let me put it in perspective for you – 4 trillion 8-gigabyte iPods of data by 2020 (35 ZB). And you know what? I ’ m willing to bet this is conservative, because some new social networking capability is going to pop up, or a faster smarter more mobile compute technology that allows us to be even more smarter and interconnected and (hopefully) intelligent, so we are on a pace that is unprecedented in the history of the world. In short, there ’ s a tremendous amount of data being generated from kinds of all these instrumented and interconnected people and devices.
Think about the suitability of applications for IBM Big Data technologies. I am telling you: every single industry has a Big Data opportunity for you. For example, smarter healthcare where a hospital can pick up the sensor readings off of neonatal babies to try to foreshadow incoming problems based on trends. We work with homeland security today. The US President Barack Obama is the Twitter President, if when an event happens, he tweets about it and homeland defence wants to know how people respond and if there are groups to focus on that are expressing negative sentiment laced with terrorism or wrong-doing. Just look across any industry and you ’ re going to find some reoccurring themes. One of those themes is more data, because I (and business for that matter) believe we can make better decisions when you have access to more data, or we can keep that data longer. More data that ’ s persisted for longer periods of time leads to better models. So that ’ s definitely a recurring Big Data theme: “ I want to keep more and more data to get better and better insight, and I want to be able to have analysis on the data that—when it ’ s NOT only structured ” There ’ s unstructured and semi-structured to fold into our mostly structured analytics of today and ALL industries are facing this challenge today (and can benefit from solving it). Lots of uses cases here. For example: Financial Services: Detect and prevent fraud, model and manage risk, personalize banking and insurance products, compliance, archival, +++ Healthcare: Patient monitoring, predictive modeling, compliance, archival, text search, data drive research, +++ Retail: Behavioral analysis, cross selling, recommendation engine (next best offer – NBO), optimize pricing, placement, and design, optimize inventory and distribution, +++ Web/Social/Mobile: Sentiment analysis, Web log, image, and video analysis, personalization, billing, reporting, network analysis, +++ Manufacturing: simulation, analysis, design, improve service via product sensor data, “Digital Factory” for lean manufacturing, +++ Government: detect and prevent fraud, homeland security and intelligence, support open data initiatives, +++
This agenda was developed to show the affect of IBM’s big data solutions on the areas of the business that most Chief Marketing Officers consider as crucial to the health of the business. We’ll dive into each of the three areas, describe solutions to address retailers’ needs, and present two of the most popular big data use cases.
There are two alternatives for implementing social media analytics. One is CCI and the other is a bespoke solution running on BigInsights. Most will be bespoke because customers already have some of the components they need. CCI is a complete offering with all componentry.
Transactional Analytics Data Warehousing Traditionally, POS-based analytics have been sourced through data warehouses. POS data received from the store is cleansed, formatted and stored in the data warehouse. It is subsequently summarized and used for reporting. Operational and Ad Hoc Analytics Teradata has the dominant data warehousing base in retail with most of their implementations based on storing Point of Sale (POS) data for follow-on summary query and reporting. Teradata's sales mantra while building that base was that retailers should store detailed POS transactions so they could be summarized as needed to support operational and ad hoc analytics. Operational analytics are those used for daily decision making and are repeated as new data is received. An example of operational analytics is weekly sales reports. Ad hoc analytics are those required to answer high value, point-in-time questions. Once the question is answered it is normally not asked again, or it could become an operational report if deemed to have repeatable value. An example of a retail, ad hoc analytics would be a prediction model about potential responses to a sales promotion. Difficulties with Data Warehouse Ad Hoc Analytics Many retailers now store multiples years of POS transactions to enable ad hoc analytics, but in truth rarely use them. The primary reason is the difficulty to access the data for ad hoc questions. In execution the transactions are only used for standard reporting and usually only query of recent information. Only the occasional year-over-year report or compliance requirement query will access old data. In order to provide ad hoc access to historical information in a data warehouse an analyst must write SQL programs to use the database structure to retrieve the necessary data; and those programs must be run with the warehouse is not busy with normal reporting, and that's usually at night. The inability of analysts to access data to answer normal business questions has resulted in many data warehouses earning the nickname “data cemetery”, because once it goes in it's never seen again. BigInsights with its Map/Reduce interface opens access to the older data and allows analysts to quickly build necessary tables without interfering with operational warehouses. In fact, it offers a spreadsheet-like interface. Cost Justification Factors Teradata is an expensive platform, and many retailers suffer with poor response time to minimize the cost of service. A Netezza study estimated the cost/terabyte of data in a Teradata warehouse to be approximately $7K/month. The same study calculated the average cost of a Netezza warehouse to be about $5K/month. Industry studies estimate the cost of building a Hadoop platform to be much less because of the use of commodity hardware and the avoidance of the need to store structure (about half the storage in a data warehouse). Moving little used data to BigInsights could be a self-funding project. Example Implementation We recently performed a proof of concept at a premier, large retailer to “prove” the above scenarios. POS data was loaded into a BigInsights platform in a cloud and analysts were given known business problems to solve. The POC was a success and several current business problems were solved. The platform proved so useful that the POC went beyond its design scope and solved problems that “walked through the door” because the business learned about the capabilities being tested. The implementation at that retailer is a model of what will be proposed at other retailers.
The 360-degree view of the customer is not a new thought, but it is typically implemented in a way that ignores customer correspondence. So not really a 360-degree view if you are ignoring what your customer(s) actually say now is it? The idea here on these use cases is to really look holistically on what customers are telling you about their interest, likes, dislikes, concerns and about their risk to you as a part of how your treat them. BigInsights is used here to gather and do the unstructured analytics, Streams can the look for patterns identified in real time, and as with all these use cases Netezza can be the destination for further analytics.
Risk and compliance are key topics today – what we’re doing with our Big Data portfolio is making broader, faster and more holistic risk management possible. In most cases firms shrink the risk information sources utilized to fit conventional processing methods, and that simply doesn’t work well or need to be the case today. Our stance is use all the available sources and store/retain/compute as necessary to deal with the risk and we, IBM, will provide the platform capabilities necessary to do so.
Much has been said regarding the proliferation of social media – its multiple channels, scope of content and subject matter. Something for everyone. Available to everyone. Immediate and impactful. But what’s different from the media explosion of television and radio some 50 year’s ago is both the sheer volume and influence of social media. 770 million people have visited a social networking site, according to comScore … According to Forrester research, 4 out of 5 Americans use social media in some capacity. But it’s the power of influence and massive distribution that make social media such a potent force in influencing consumer perceptions. In fact, 78% of consumers trust their peer’s recommendations … And it’s this volume of content, distribution and influence that is re-shaping how organizations are engaging their customers and broader constituencies through social media, there relationship to brands, products, services and issues of the day. Given this Social media analytics is a hot topic for most firms, and while some basic solutions are starting to show up, there is a lot of work that remains. Banks/Brokers are keen on trying to better understand the needs and desires of their customers to increase sales. Much of this analytics needs to include Social Media as one of, not the ONLY, source. That makes these systems more hoc in nature and cross information type, and are better supported by BigInsights than data warehouses.
> Click to animate < > Click to animate < This is an EXPLOSION of data in the communications industry > Click to animate < Adding data at the rate of 500 Petabytes per month last year to > Click to animate < 10 times that monthly amount by four years. Some examples of where that data comes > Click to Next Slide < from in the communications industry.
The AT&T global backbone network carries just under 24 petabytes of data traffic on an average business day. Put another way, that is 53,549 lbs (24,289 kg) of blu ray disks of data every day. Over 550,000 new Android smart phones are activated each day on top of a slightly smaller number of iPhones. That is driving data vs voice through networks at the ratio of 11:1. IPTV through services like YouTube is becoming the dominant data traffic on communications networks everywhere. These are just examples that illustrate the following points: Traffic is beginning to exceed infrastructure capacity – driving network costs up. Communications companies are having a hard time increasing revenue to cover those costs. Driving profit from this market is becoming much harder every month, let alone every year. In the next few minutes, I will show you why All Telecommunications (And Media) companies have Big Data Problems! Click to Next Slide < Public Domain Facts and Notes: &quot;AT&T- News Room&quot;. Att.com. 2008-10-23. http://www.att.com/gen/press-room?pid=4800&cdvn=news&newsarticleid=30623. Retrieved 2009-08-16. The last 12 quarters have seen 30 X growth - that's 3,000 % growth - in traffic across the AT&T Network. Growth rate in just Q4 2009 - just that one quarter - was greater than the entire Network traffic for the previous year - 2008. The amount of capital investment and corporate effort in 2010, this year (2011) and next year (2012) will be roughly equivalent to that of building the the Hoover Dam In the field of telecommunications, data retention (or data preservation ) generally refers to the storage of call detail records (CDRs) of telephony and internet traffic and transaction data (IPDRs) by governments and commercial organizations. In the case of government data retention, the data that is stored are usually of telephone calls made and received, emails sent and received and web sites visited. Location data is also collected. The primary objective in government data retention is traffic analysis and mass surveillance. By analyzing the retained data, governments can identify the locations of individuals, an individual's associates and the members of a group such as political opponents. These activities may or may not be lawful, depending on the constitutions and laws of each country. In many jurisdictions access to these databases may be made by a government with little or no judicial oversight (e.g. USA, UK, Australia). In the case of commercial data retention, the data retained will usually be on transactions and web sites visited. Data retention also covers data collected by other means (e.g. by automatic numberplate recognition systems) and held by government and commercial organisations. Telecoms: AT&T transfers about 19 petabytes of data through their networks each day.[9] Telecoms: The AT&T global backbone network carries 23.7 petabytes of data traffic on an average business day. The last 12 quarters have seen 30 X growth - that's 3,000 % growth - in traffic across the AT&T Network. Growth rate in just Q4 2009 - just that one quarter - was greater than the entire Network traffic for the previous year - 2008. The amount of capital investment and corporate effort in 2010, this year (2011) and next year (2012) will be roughly equivalent to that of building the the Hoover Dam. Transcript : Now this explosion in data really is quite extraordinary in terms of how much information is out there today. You know, you look at 7 TB of data every day on Twitter. I mean, that's just a remarkable phenomenon. Facebook, probably a lot of you saw the movie that's out on the -- The Social Network -- 10 TB every day. Some of it's certainly entirely useless; some of it people wish never went on Facebook. You know, that expression what happens in Vegas stays in Vegas, it's not true. What happens in Vegas goes on the web and will live on for hundreds of years, all right. Your, the future members of your family will look back on some of the things that you did that have been digitally recorded and just shake their heads in disgust, so be careful, be careful. So pretty extraordinary in terms of what's happening around data in information. Author ’s Original Notes: IBM IOD 2010_GS Day 2 Transcript : Now this explosion in data really is quite extraordinary in terms of how much information is out there today. You know, you look at 7 TB of data every day on Twitter. I mean, that's just a remarkable phenomenon. Facebook, probably a lot of you saw the movie that's out on the -- The Social Network -- 10 TB every day. Some of it's certainly entirely useless; some of it people wish never went on Facebook. You know, that expression what happens in Vegas stays in Vegas, it's not true. What happens in Vegas goes on the web and will live on for hundreds of years, all right. Your, the future members of your family will look back on some of the things that you did that have been digitally recorded and just shake their heads in disgust, so be careful, be careful. So pretty extraordinary in terms of what's happening around data in information. Author ’s Original Notes: 06/12/12 Transcript : Now this explosion in data really is quite extraordinary in terms of how much information is out there today. You know, you look at 7 TB of data every day on Twitter. I mean, that's just a remarkable phenomenon. Facebook, probably a lot of you saw the movie that's out on the -- The Social Network -- 10 TB every day. Some of it's certainly entirely useless; some of it people wish never went on Facebook. You know, that expression what happens in Vegas stays in Vegas, it's not true. What happens in Vegas goes on the web and will live on for hundreds of years, all right. Your, the future members of your family will look back on some of the things that you did that have been digitally recorded and just shake their heads in disgust, so be careful, be careful. So pretty extraordinary in terms of what's happening around data in information. Author ’s Original Notes: Prensenter name here.ppt Transcript : Now this explosion in data really is quite extraordinary in terms of how much information is out there today. You know, you look at 7 TB of data every day on Twitter. I mean, that's just a remarkable phenomenon. Facebook, probably a lot of you saw the movie that's out on the -- The Social Network -- 10 TB every day. Some of it's certainly entirely useless; some of it people wish never went on Facebook. You know, that expression what happens in Vegas stays in Vegas, it's not true. What happens in Vegas goes on the web and will live on for hundreds of years, all right. Your, the future members of your family will look back on some of the things that you did that have been digitally recorded and just shake their heads in disgust, so be careful, be careful. So pretty extraordinary in terms of what's happening around data in information. Author ’s Original Notes:
Industry statistics show that it is between 4 and 7 to 1 more profitable to keep a customer than attract a new one. Over 70% of CMO ’s interviewed named Churn as one of there top three problems facing their enterprise. Addressing current challenges is an important part of a future-focused strategy. The rising costs of acquisition, retention and servicing customers need to be controlled, churn reduced and average revenue per user (ARPU) erosion has to be curtailed. Such efforts depend on differentiating the customer experience and creating new revenue streams by tapping into rich customer data, applying analytics and working effectively with content providers. > Click to Next Slide <
InfoSphere Streams supports real-time mediation by handling billions of CDRs each day and linear scalability for growth. CDR mediation for billing systems have been around for decades. Using the Streams Telecommunications Mediation and Analytics (TMA) offering supports the following: A platform for real-time analytics on CDR’s Offloaded CDRs processing to Streams platform enhances warehouse performance and improved TCO Single platform for mediation and real time analytics reduces IT complexity The Business Benefits are substantial and include: Real time CDR processing enables real time billing – faster billing equals more profit Provides platform for real-time analytics to drive revenue: for example, location driven marketing campaigns. Data now processed reduced from 12 hours to 1 second. HW costs reduced 87% Support for future growth without the need to re-architect, more data, more analysis. Finding and addressing the negative sentiment that dropped calls by high priority customers proactively. Addressing terminated calls by location and customer type for customer service as well as fraud detection. Real Time Mediation can lead to fresh BI applications. Lets examine what the TMA offering looks like architecturally. > Click to Next Slide <
CDR Analytics can be extended with BigInsights integrating with existing warehouse and BI infrastructure. By using the IBM Big Data at Rest solutions, huge volumes of CDR and OTHER data can be ingested in the format they arrive in. Because of the cost effectiveness of BigInsights and its integration capabilities, new analytics and insights are derived from combining CDR ’s with social media, clickstream, urls, and other unstructured data. An example would be to discover relationships between dropped calls and abandoned carts. Or, how consumer sentiment relates to web navigation, and local cell conditions in order to better predict churn. > Click to Next Slide <
Measuring ad effectiveness is a problem as old as mass media ads where invented. The complexity of this has increased dramatically with the advent of the internet in the ’90’s and mobile technology in the 2000’s. Using social media as a new source for measuring the response to ads quickly or in real time is the focus of this case study. Media customers are very eager to find ways to do this in a cost effective way. There are many agencies and services that can explore social media and report back on sentiment but there is are three problems with that approach so far: Sentiment scores (likes and dislikes) only give a small part of an answer. Social Media Analytics that do not create actionable insights tied to direct business decisions are just a “buzz meter”. The time lag for even simple sentiment scores is too long to take effective action. Social media is ever changing. Getting ahead of it with insights that are very specific to your requirement is very difficult and costly. The solution is a combination of IBM Big Data and IBM ’s methodology developed by GBS. First, lets examine the IBM Big Data Technology that supports the ad effectiveness solution. > Click to Next Slide <
This is a closer look at the Big Data platform. You can see the product view and how each fits into the IBM Big Data platform.
So I ’ m likely going to start to mention some products here around the IBM Big Data platform. +CLICK+ Hadoop is about bringing all this data into an at-rest batch-based repository. You can see on this slide open source Hadoop can be used to analyze semi-structured data, structured data (there are times when this should be done in an EDW and sometimes in a Big Data system), and unstructured data. +CLICK+ The IBM Big Data platform EMBRACES and EXTENDS Hadoop. As I mentioned before, IBM won ’ t fork the Hadoop code. +CLICK+ The IBM at-rest solution for Big Data is built on Hadoop and it ’ s called IBM InfoSphere BigInsights (BigInsights). And as I said before, we ’ re not going to fork that code, we are going to embrace and extend it. BigInsights ‘ hardens ’ Hadoop and rounds it out to make it more enterprise-worthy. Our at-rest story also includes Netezza as a repository and this engine includes the ability to run MapReduce (the programming framework around Hadoop) program IN-DATABASE. For some analytics workloads, MapReduce is a better choice than SQL and for data that ’ s more fit for the EDW (static, structured, repeatable, governed) Netezza is a terrific fit here. We also extend Big Data with industry leading in-motion technology with a product called InfoSphere Streams (I talked about it earlier). None of our competitors are really talking about Big Data in-motion. Some people will talk about complex event processing (CEP), which is about 10,000 transactions a second; it ’ s not at the speed or the scale which this is; and CEP can only tolerate simple rules and mostly structured data. +CLICK+ The IBM Big Data platform then focuses on two key value propositions: Operational Excellence and Analytical Excellence. Operational Excellence - on the right - for the BigInsights platform (and assumed in the Netezza platform) details what we do for BigInsights above and beyond what open source Hadoop ships. I will be honest, there are a lot of vendors doing this today, Cloudera, MapR, HortonWorks, there ’ s a lot of people talking about making Hadoop operationally excellent. Well, we know something about operational excellence, because we ’ re IBM, so we have this enterprise grade proven file system called GPFS. We ’ ve ported that into work in Hadoop as GPFS SNC and since it ’ s POSIX it makes life easier, more secure, and more performant than an open source Hadoop world. For example, IBM understands that security is important, so we use GPFS SNC and extended capabilities in Hadoop to provide surface area lockdown: BigInsights gives you granular role-base security. You can attach policies around retention and the mutability (change rights). These are pretty important things. Adaptive MapReduce is kind of like connection pooling for a database, it makes the system run faster without you having to tune it. There is a workload manager, with a very fast Hadoop-oriented compression algorithm that is splittable. Rich tooling for management. In short, operational excellence is going to appeal to the folks managing the Big Data platform. It ’ s important. IBM does a great job – others do a good to great job – but at the end of the day, you get a well running Hadoop cluster. That ’ s it. From an in-motion perspective, the operational excellence is unparalleled and I have yet to see another vendor able to seriously challenge us in this area. The value for the business is on the left side that I refer to as Analytical Excellence. The IBM Big Data platform provide these toolkits that let you get building analytics faster, more reliable and more potent than you could otherwise, and we do this for both Big Data at-rest and in-motion. There are industry accelerators, development tooling, visualization tooling, text analytics tooling, machine learning toolkit (which is coming in the future) and this is where the magic happens. It ’ s where you get a solution – IBM is telling the story on how to get to analytics and I ’ m going to use an example later in this presentation to show you just how much of a head start IBM gives you in Big Data.
UOIT Nosocomial is a term that simply means 'hospital acquired' or 'got this bug/infection while in the hospital. Carolyn is working to detect blood poisoning (a nosocomial infection) which is also called SEPSIS. The specific test is that the infant oxygenation level (aka SpO2 for peripheral oxygenation level) drops below 85% blood pressure (aka Mean Arterial Pressure - MAP) drops below the gestational age measured in weeks for the same 20 seconds. Another test Carolyn is running is to determine if the baby is about to crash. (crash means heart stop). Normal people and premise have variability in their heart rate - speeds up, has difference in peaks, time between the stages of the heart wave, etc. When babies are about to crash, they try to preserve all energy and this variability drops. So, we are analyzing EKG waves to determine each heartbeat wave, and using Fast Fourier Transforms (FFT's) to determine the area under the wave. Then comparing the waves to understand variability. This test is know as Heart Rate Variability (HRV). Hospital equipment issues an alert when a vital sign goes out of range – prompting the hospital staff to take action immediate. However many live threatening conditions do not reach critical levels right away. Often signs that something is wrong begin to appear long before the situations becomes serious and even a skilled nurse or physician might not be able to spot and interpret these trends in time to avoid serious complications. What’s more is some of these warning indicators are hard to detect and it’s next to impossible to understand their implications until it is too late. For example, nosocomial infection, a life threatening illness contracted in hospitals. Research has shown that signs of this infection can appear 12-24 hours before overt trouble/distress is spotted. Making things more complex, in a baby where this infection has set in, heart rates stay too normal (it doesn’t rise and fall within the day as it would for a healthy baby); all the while the pulse is within the acceptable limits. While information needed to detect the information is present, it’s too subtle, the nurses are too busy to see out of normal individual events. In a neonatal ward, the ability to absorb and reflect upon everything presented is beyond human capacity, there is just too much data. Information in these hospitals just wasn’t being used. Machines provide up to 1000 reading per second is summarized into a single reading every 30-60 minutes and then discard 72 hours later. Consequently, a set of rules that reflect the best understanding of the problem have been built, and they can be dynamically changed. Now extend this to kids with cancer attending school, and so on. Kinds of things that feed the data: Entrotrachael Tube, Nastrogastic Tube, Ventilator Hose, Oxygen, Pulse, Hearth, Skin temperature, Body temperature, translucency, bilaterally placed electrodes, reference electrode, +++
US Department of Energy is a national defense priority among other things. So research needs to be safeguarded by both above and below ground biological and mechanical threats. Their solution has to continually consume and analyze information in-motion such as movements of animals, humans, the atmosphere (such as wind). Scientists lacked the time to record the data and listen to it later. The data consumption and analytical requirements would be akin to listening to 1000 MP3 song simultaneously and looked for the word “Rocket” in every song – within a fraction of a second. TerraEchos has one of the most robust classification systems in the industry. They use Adelos S4 fiber-optic acoustic sensor technology from the US Navy. They can figure out the difference between a human whisper, the pressure of a footstep, and between the sound of a human voice and the whisper of the wind.
Time is of the essence when analyzing customer call data to serve up location dependent offers/advertisements, identify possible network problems, or provide reps with the latest information on a customer calling with a service problem. Sprint needed to be able to access and analyze call, internet usage and texting detail records (xDRs) in real-time. The company had been using Microsoft SQL Server as part of a homegrown solution to transform data for analysis and feed it to their Netezza warehouse. With the introduction of 3G technologies and the corresponding explosion in data volume, this Microsoft-based solution was unable to meet SLAs and performance requirements set by the business. The technology owners knew this problem would only get worse with the transition to LTE (4G). The latency created by the system meant Sprint was unable to capitalize on new revenue opportunities, and was forced to be reactive, rather than proactive, in addressing customer and network issues. The IBM team worked with the part of Sprint’s organization responsible for running the company’s network (rather than the IT department) to propose InfoSphere Streams as a truly real-time conduit for xDR analysis. A proof of concept using InfoSphere Streams provided Sprint with overwhelming and indisputable evidence that the Microsoft-based solution should be replaced. The POC showed that with Streams: The time to merge data was reduced by 91%, the time to load data was reduced by 92%, and storage requirements were reduced by 93% A core component of IBM’s platform for big data, Streams provides near linear growth when adding additional nodes to the runtime cluster. Applications can be re-deployed without being re-written to take advantage of the extra hardware. This gives Sprint tremendous flexibility to tailor their infrastructure to their business requirements. For example, based on the POC, Sprint can select the number of blades to meet their velocity requirements. This is a great example of clients
This slide shows a very simple example of the end goal for text analytics. Imagine an application that converts text to speech. In this very simplistic example, it’s a text to speech application that takes a streaming radio broadcast and finds structure within it.
To use an analogy for the text analytic toolkit that comes with the IBM Big Data platform, I will refer to a shopping trip to the art store I had with my daughter the other day. We were looking to buy some color by paintings, and my gosh, t ’ s not just Disney and Phineas & Ferb. There are some very rich detailed beautiful portraits: Monet, Renoir, you name it – I never imagined. And I looked at what we did and in, you know, the IBM Big Data platform, we actually offer this whole toolkit which remind me of this. The toolkit has everything you need. And so you go and here ’ s this really rich and detailed colour by numbers painting (our tools, our Annotator Query Language, our pre-built extractors) that allow you to paint this wonderful picture. I could extend it in the background and put some trees in there (add to the extractors IBM is providing with a further set of rules) and you end up with this vibrant picture. Now imagine decorating a room. All the other vendors seem to leave you with tools to paint the wall and put a hook in the centre to hang some art and you certainly aren ’ t going to win any type of decorating awards with that. It ’ s all left up to you and this is EXACTLY what Cloudera and MapR are enticing our clients to do. They ’ re getting clients to go in and say that decorating the wall is painting it, or they ’ re just saying, here ’ s the paint, you do the rest. Folks, the painting the wall is the easiest part; it ’ s the operational excellence. It ’ s easy. I think we do a better job but it ’ s easy. So what do you do if you ’ re on the Cloudera platform? I guess you go and buy some different tools, get some skills, hire them out, and there ’ s my painting. I ’ m not much of an artist. Then I go take some courses or pay for expensive skills and you know what? A lot of your clients, they have smart people so that ’ s where they get to on the right. You can see, well they kind of misinterpreted the branch, they got the nose different and this may look okay but it ’ s not as rich and detailed as the picture on the left. And the point is Big Data technologies were born in the Yahoos, in the Googles, in the Facebooks of the IT world. These folks have mountains of developers, near unlimited development resources. But, you know, if I ’ m an insurance company or I ’ m a credit card company, I don ’ t have unlimited development resources. But chances are I ’ m outsourcing some of that development, and I ’ m going to show you why that won ’ t work, so we ’ re not—you know, we ’ re—this is not our—our core competency is not development; it ’ s our business. So why are you asking me to be core competency? But everybody ’ s jumping on the big data wagon right now because it ’ s the hottest thing going.
One of the Big Data challenges is “ How do I get analysts to go out and analyze this data with zero programming ” . If you don ’ t have such tooling, you create an unnatural dependency on development to go and hand-code and build every piece of visualization and analysis. This is too expensive, inefficient, and just too cumbersome. BigSheets gives you exactly this, with ZERO programming. Your analysts are going to need to be able to go visualize and analyze JSON formats, CVS and text files and all that kind of stuff; they are going to want a programming free crawler, and more. All of this is included in BigSheets – to the end user, it ’ s looks like a spreadsheet; under the covers, it ’ s generate Pig jobs to run on Hadoop.
Here is a screen shot of BigSheets – again, it looks like a ubiquitous spreadsheet software and we all know (for good or bad) that this is the most popular analytic tool in the world, and that ’ s why we built BigSheets to operate like a spreadsheet. Just like in a spreadsheet, you can pivot data union it, run some macros, and so on. BigSheets is built with Web 2.0 technologies and runs within a Web browser – it is tightly integrated into the BigInsights management toolset.
Here is another example of something the University of Southern California Annenberg School of Communication did with the IBM Big Data platform ’ s BigSheets technology. USC@Annenburg created the Film Forecaster tool and used it to correctly predict 2011 ’ s summer block busters based on scraping Twitter and analyzing that against a simple lexicon that described a positive or negative showing for a movie. They made quite the impact since this very solution was featured on ABC News (a national news agency in the USA). More striking is the quote: the application was built by a communication Masters student who learned Big Sheets in a day.
+CLICK+ Now that we’ve talked about how end users visualize Big Data, and how IT can deploy the applications – let’s talk about the hardest part of all, building them – and let’s start with Big Data in-motion: InfoSphere Streams. How do you do this on your own? If you choose to build it out yourself, and remember this IS NOT CEP, it ’ s WAY more scalable, resilient, available, and has handlers for unstructured data (which is going to be my point): if you build this yourself YOU have to worry about event handling, check pointing, security, availability, provisioning, debugging, and all kinds of other stuff shown (and some not shown) on this slide. +CLICK+ The IBM Big Data platform offers you a toolkit to build in-motion Big Data applications. Inside this toolkit are accelerators and tooling and more that let you build out something very quickly and powerful. For example it includes the Streams Processing Language (SPL). SPL came out in the 2nd release of InfoSphere Streams. TerraEchos uses this piece of the toolkit extensively, and with it, they are able to build their applications 45 percent faster! So we give you the run time and infrastructure services that kind of take care of all the hard stuff for you, whether one node is overloaded, whether one node goes down, you don ’ t have to worry about that. We ’ ve got that covered. So you just kind of build the logic. SPL is a declarative language in the same way that SQL and Annotated Query Language (AQL) are. Specifically, parts of SPL are are truly declarative, but there are parts and extensions to all of these that are not completely declarative, but that ’ s beyond the scope of the point being made here. IBM has a two decades history of taking declarative languages and getting them to run in massively parallel processing (MPP) environments; and that ’ s exactly what Streams and Hadoop clusters are. This allows you to spend more time on building the application as opposed to fine tuning it for performance which is what folks typically have to do. Think ISAS (DB2 with DPF) and SQL – it ’ s the same concept. Beyond MPP optimization, SPL has number of local optimizations which will include new auto-parallelization and pipelineing that has not made it into the product yet as of 1Q12, but will be coming soon. IBM ships a number of accelerators to help you get started and flatten the time to value for the Big Data deployment curve. In this section, I will talk about Text Analytics Toolkit for the remainder of this presentation. There are accelerator kits for Telco, Smarter Energy, Public Transportation, Finance, Data Mining --- and more on the way. Over 100 sample applications, user defined toolkits and standard toolkits with over 300 functions and operators. Telco: Process Call Data in Real Time. This is the foundation for realtime marketing promotions, churn prevention, etc. Finance: Real Time Market data ingestion and management, Real Time decision support for Equities, Derivate, Commodity and Forex trading, Incorporate additional contextual awareness (news, weather etc.) into trading decision, Real Time cross asset pricing, Continuous real time trade monitoring to identify fraudulent trading, Real Time Cross Asset across trading desks and geographies for a continuous enterprise risk level and liquidity management. Smarter Energy: Sample applications to monitor electric transmission grids using phasor measurement units to monitor for voltage stability and transient stability to improve availability. Data Mining: Mining data streams to extract relevant information or intelligence is critical for a majority of stream processing applications. IBM InfoSphere Streams Mining Toolkit integrates with InfoSphere Warehouse using PMML standard. PMML is supported by several state-of-the-art statistics and data mining software tools such as InfoSphere Warehouse, R / Rattle, SAS Enterprise Miner, SPSS, and Weka. InfoSphere Streams, used with the Mining Toolkit, can help you detect fraud, prevent customer churn, segment your customers, and simplify market basket analysis. The in-database data mining capabilities integrate with existing systems to provide scalable, high performing predictive and pattern analysis without moving your data into proprietary data mining platform. Public Transportation: Sample intelligent transportation system to display current location of buses based on GPS readings and estimate time of arrival at future stops based on current traffic . User Defined Toolkits: Create reusable sets of operators and functions. A powerful base for creating cross-domain and domain-specific toolkits. How Text Analytics Works This slide shows a very simple example of the end goal for text analytics. Imagine an application that converts text to speech. In this very simplistic example, it’s a text to speech application that takes a streaming radio broadcast and finds structure within it. What ’ s Wrong with Text Analytics Today There are lots of alternative approaches and infrastructure for text analytics in the marketplace today. They tend to perform poorly in terms of accuracy and speed. They ’ re very difficult to use. They typically require an army of Java programmers to get stuff going. They ’ re often characterized in the Internet as inflexible and inefficient, because the programmer has to go to the analyst; the analyst then takes the annotator designed to extract the text; it doesn ’ t work right; so the two have to get back together again and it becomes an iterative loop and this hurts analyst productivity. If you ’ ve ever worked to resolve performance problems in a Java Hibernate environment with developers and DBAs, or perhaps in an SAS ad DBA environment trying to implement a model, you know the error prone inefficient process here.