Let’s start by looking at some of the pioneers in the big data space. These well known, and highly valuable, enterprises have built their business on Big Data. The numbers they support are staggering.
But Big Data is for more than just internet companies. This slide shows some Greenplum customer examples who are leveraging big data to transform the business and drive new revenue streams. We ill talk about these in more detail today.
Think about what Big Data is for a moment. Share your thoughts with the group and write your notes in the space below.Is there a size threshold over which data becomes Big Data?How much does the complexity of its structure influence the designation as Big Data? How new are the analytical techniques?
There are multiple characteristics of big data, but 3 stand out as defining Characteristics: Huge volume of data (for instance, tools that can manage billions of rows and billions of columns)Complexity of data types and structures, with an increasing volume of unstructured data (80-90% of the data in existence is unstructured)….part of the Digital Shadow or “Data Exhaust”Speed or velocity of new data creation In addition, the data, due to its size or level of structure,cannot be efficiently analyzed using only traditional databases or methods.There are many examples of emerging big data opportunities and solutions. Here are a few: Netflix suggesting your next movie rental, dynamic monitoring of embedded sensors in bridges to detect real-time stresses and longer-term erosion, and retailers analyzing digital video streams to optimize product and display layouts and promotional spaces on a store-by-store basis are a few real examples of how big data is involved in our lives today. These kinds of big data problems require new tools/technologies to store, manage and realize the business benefit. The new architectures it necessitates are supported by new tools, processes and procedures that enable organizations to create, manipulate and manage these very large data sets and the storage environments that house them.
Big data can come in multiple forms. Everything from highly structured financial data, to text files, to multi-media files and genetic mappings. The high volume of the data is a consistent characteristic of big data. As a corollary to this, because of the complexity of the data itself, the preferred approach for processing big data is in parallel computing environments and Massively Parallel Processing (MPP), which enable simultaneous, parallel ingest and data loading and analysis. As we will see in the next slide, most of the big data is unstructured or semi-structured in nature, which requires different techniques and tools to process and analyze.Let us examine the most prominent characteristic: its structure.
The graphic shows different types of data structures, with 80-90% of the future data growth coming from non structured data types (semi, quasi and unstructured). Although the image shows four different, separate types of data, in reality, these can be mixed together at times. For instance, you may have a classic RDBMS storing call logs for a software support call center. In this case, you may have typical structured data such as date/time stamps, machine types, problem type, operating system, which were probably entered by the support desk person from a pull-down menu GUI. In addition, you will likely have unstructured or semi-structured data, such as free form call log information, taken from an email ticket of the problem or an actual phone call description of a technical problem and a solution. The most salient information is often hidden in there. Another possibility would be voice logs or audio transcripts of the actual call that might be associated with the structured data. Until recently, most analysts would NOT be able to analyze the most common and highly structured data in this call log history RDBMS, since the mining of the textual information is very labor intensive and could not be easily automated.
Here are examples of what each of the 4 main different types of data structures may look like. People tend to be most familiar with analyzing structured data, while semi-structured data (shown as XML here), quasi-structured (shown as a clickstream string), and unstructured data present different challenges and require different techniques to analyze.For each data type shown, answer these questions: What type of analytics are performed on these data?Who analyzes this kind of data?What types of data repositories are suited for each, or requirements you may have for storing and cataloguing this kind of data?Who consumes the data?Who manages and owns the data?
Here are 4 examples of common business problems that organizations contend with today, where they have an opportunity to leverage advanced analytics to create competitive advantage. Rather than doing standard reporting on these areas, organizations can apply advanced analytical techniques to optimize processes and derive more value from these typical tasks. The first 3 examples listed above are not new problems – companies have been trying to reduce customer churn, increase sales, and cross-sell customers for many years. What’s new is the opportunity to fuse advanced analytical techniques with big data to produce more impactful analyses for these old problems. Example 4 listed above portrays emerging regulatory requirements. Many compliance and regulatory laws have been in existence for decades, but additional requirements are added every year, which mean additional complexity and data requirements for organizations. These laws, such as anti-money laundering and fraud prevention, require advanced analytical techniques to manage well.
The graphic shows a typical data warehouse and some of the challenges that it presents. For source data (1) to be loaded into the EDW, data needs to be well understood, structured and normalized with the appropriate data type definitions. While this kind of centralization enables organizations to enjoy the benefits of security, backup and failover of highly critical data, it also means that data must go through significant pre-processing and checkpoints before it can enter this sort of controlled environment, which does not lend itself to data exploration and iterative analytics.(2) As a result of this level of control on the EDW, shadow systems emerge in the form of departmental warehouses and local data marts that business users create to accommodate their need for flexible analysis. These local data marts do not have the same constraints for security and structure as the EDW does, and allow users across the enterprise to do some level of analysis. However, these one-off systems reside in isolation, often are not networked or connected to other data stores, and are generally not backed up.(3) Once in the data warehouse, data is fed to enterprise applications for business intelligence and reporting purposes. These are high priority operational processes getting critical data feeds from the EDW.(4) At the end of this work flow, analysts get data provisioned for their downstream analytics. Since users cannot run custom or intensive analytics on production databases, analysts create data extracts from the EDW to analyze offline in R or other local analytical tools. Many times these tools are limited to in-memory analytics with desktops analyzing samples of data, rather than the entire population of a data set. Because these analyses are based on data extracts, they live in a separate location and the results of the analysis – and any insights on the quality of the data or anomalies, rarely are fed back into the main EDW repository. Lastly, because data slowly accumulates in the EDW due to the rigorous validation and data structuring process, data is slow to move into the EDW and the schema is slow to change. EDWs may have been originally designed for a specific purpose and set of business needs, but over time evolves to house more and more data and enables business intelligence and the creation of OLAP cubes for analysis and reporting. The EDWs provide limited means to accomplish these goals, achieving the objective of reporting, and sometimes the creation of dashboards, but generally limiting the ability of analysts to iterate on the data in an separate environment from the production environment where they can conduct in-depth analytics, or perform analysis on unstructured data.
Today’s typical data architectures were designed for storing mission critical data, supporting enterprise applications, and enabling enterprise level reporting. These functions are still critical for organizations, although these architectures inhibit data exploration and more sophisticated analysis.
…..describe or refer to NO SQL and KVPEveryone and everything is leaving a digital footprint. The graphic above provides a perspective on sources of big data generated by new applications and the scale and growth rate of the data. These applications provide opportunities for new analytics and driving value for organizations.These data come from multiple sources, including:Medical Information, such as genomic sequencing and MRIsIncreased use of broadband on the Web – including the 2 billion photos each month that Facebook users currently upload as well as the innumerable videos uploaded to YouTube and other multimedia sitesVideo surveillanceIncreased global use of mobile devices – the torrent of texting is not likely to ceaseSmart devices – sensor-based collection of information from smart electric grids, smart buildings and many other public and industry infrastructureNon-traditional IT devices – including the use of RFID readers, GPS navigation systems, and seismic processingThe Big Data trend is generating an enormous amount of information that requires advanced analytics and new market players to take advantage of it.
Big data projects carry with them several considerations that you need to keep in mind to ensure this approach fits with what you are trying to achieve. Due of the characteristics of big data, these projects lend themselves to decision support for high-value, strategic decision making with high processing complexity. The analytic techniques being used in this context need to be iterative and flexible (analysis flexibility), due to the high volume of data and its complexity. These conditions give rise to complex analytical projects (such as predicting customer churn rates) that can be performed with some latency (consider the speed of decision making needed), or by operationalizing these analytical techniques using a combination of advanced analytical methods, big data and machine learning algorithms to provide real time (requires high throughput) or near real time analysis, such as recommendation engines that look at your recent web history and purchasing behavior.In addition, to be successful you will need a different approach to the data architecture than seen in today’s typical EDWs. Analysts need to partner with IT and DBAs to get the data they need within an analytic sandbox, which contains raw data, aggregated data, and data with multiple kinds of structure. The sandbox requires a more savvy user to take advantage of it, and leverage it or exploring data in a more robust way.
The loan process has been honed to a science over the past several decades. Unfortunately today’s realities require that lenders take more care to make better decisions with fewer resources than they’ve had in the past. The typical loan process uses a set of data on which pre-approval and underwriting approval is based, including:Income data, such as pay and income tax recordsEmployment history to establish the ability to meet loan obligationsCredit history including credit scores and outstanding debtAppraisal data associated with the asset for which the loan is made (such as a home, boat, or car)This model works but it’s not perfect, in fact, the loan crisis in the US is proof that using only these data points may not be enough to gauge the risk associated with making sound lending decisions and pricing loans properly.Case Study Exercise:ObjectivesUsing additional data sources, dramatically improve the quality of the loan underwriting process Streamline the process to yield results in less timeDirectionsSuggest kinds of publicly available data (big data) that you can leverage to supplement the traditional lending processSuggest types of analysis you would perform with the data to reduce the bank’s risk and expedite the lending process
This is the standard format we will use for each representative example.
Check http://wiki.apache.org/hadoop/PoweredBy for examples of how people are using Hadoop Check this article on the large scale image conversion: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/. Check this for an ad for a ‘computer’ from 1892…http://query.nytimes.com/mem/archive-free/pdf?res=9F07E0D81438E233A25751C0A9639C94639ED7CF
Use the space here to record your answers to these questions:
Greenplum is driving the future of Big Data analytics with the industry’s first Unified Analytics Platform (UAP) that delivers:Our award winning Greenplum Database for structured dataOur enterprise Hadoop offering, Greenplum HD, for the analysis and processing of unstructured dataGreenplum Chorus that acts as the productivity layer for the data science teamGreenplum UAP is more than just integrated software working together; it is a single, unified platform enabling powerful and agile analytics that can transform how your organization uses data.What sets this diagram apart from a typically vendor example is the inclusion of people. That is not a mistake. We have introduced the Unified Analytics Platform but there is more to the story than technology and I will talk more about that in a few minutes. UAP is about enabling this emerging group of talent, the new practitioners, that we refer to as the Data Science team. This team can include the data platform administrator, data scientist, analysts, engineers, BI teams, and most importantly the line of business user and how they participate on this data science team.We develop, package, and support this as a unified software platform available over your favorite commodity hardware, cloud infrastructure, or from our modular Data Computing Appliance.MOORE’s LAW (named after Gordon Moore, the founder of Intel) states that the number of transistors that can be placed in a processor will double approximately every two years, for half the cost. But trends in chip design are changing to face new realities. While we can still double the number of transistors per unit area at this pace, this does not necessarily result in faster single-threaded performance. New processors such as Intel Core 2 and Itanium 2 architectures now focus on embedding many smaller CPUs or "cores" onto the same physical device. This allows multiple threads to process twice as much data in parallel, but at the same speed at which they operated previously.
Greenplum database’s strengths are in the structured side of the house. The functionality is based around the fact the data is structured.With GP MapReduce and large text objects, Greenplum database is able to do some things that are considered unstructured data analysis.
Greenplum database’s strengths are in the structured side of the house. The functionality is based around the fact the data is structured.With GP MapReduce and large text objects, Greenplum database is able to do some things that are considered unstructured data analysis.
Unfortunately, people may use the word “Hadoop” to mean multiple things. They may use it to describe the MapReduce paradigm, or they may use if to describe massive unstructured data storage using commodity hardware (although commodity doesn’t mean inexpensive). On the other hand, they may be referring to the Java classes provided by Hadoop that support HDFS file types or provide MapReduce job management. Or they may be referring to HDFS: the Hadoop distributed file system. And they might mean both HDFS and MapReduce.The point is that Hadoop enables the Data Scientist to create MapReduce jobs quickly and efficiently. As we shall see, one can utilize Hadoop at multiple levels: writing MapReduce modules in Java, leveraging streaming mode to write such functions in one of several scripting languages, or utilizing a higher level interface such as Pig or Hive. The Web site http://hadoop.apache.org/ provides a solid foundation for unstructured data mining and management.So what exactly is Hadoop anyway? The quick answer is that Hadoop is a framework for performing Big Data Analytics, and as such is an implementation of the MapReduce programming model. Hadoop is comprised of two main components, HDFS for storing big data and MapReduce for big data analytics. The storage function consists of HDFS (Hadoop Distributed File System) that provides a reliable, redundant, distributed file system optimized for large files. The analytics functions are provided by MapReduce that consists of a Java API as well as software to implement the services that Hadoop needs to function.Hadoop glues the storage and analytics together in a framework that provides reliability, scalability, and management of the data.
Let’s look a little deeper at the HDFS. Between MapReduce and HDFS, Hadoop supports four different node types (a node is a particular machine on the network). The NameNode and the DataNode are part of the HDFS implementation. Apache Hadoop has one NameNode and multiple DataNodes (there may be a secondary NameNode as well, but we won’t consider that here). The NameNode service in Hadoop acts as a regulator/resolver between a client and the various DataNode servers. The NameNode manages that name space by determining which DataNode contains the data requested by the client and redirecting the client to that particular datanode. DataNodes in HDFS are (oddly enough) where the data is actually stored. Hadoop is “rack aware”: that is, the NameNode and the Jobtracker node utilize a data structure that determines what DataNode is preferred based on the “network distance” between them. Nodes that are “closer” are preferred (same rack, different rack, same datacenter). The data itself is replicated across racks: this means that a failure in one rack will not halt data access at the expense of possibly slower response. Since HDFS isn’t suitable for near real-time access, this is acceptable in the majority of cases.
The MapReduce function within Hadoop depend on two different nodes: the JobTracker and the TaskTracker. The JobTracker node exists for each MapReduce implementation. JobTracker nodes are responsible for distributing the Mapper and Reducer functions to available TaskTrackers and monitoring the results, while TaskTracker nodes actually run the jobs and communicate results back to the JobTracker. That communication between nodes is often through files and directories in HDFS so internode (network) communication is minimized.Let’s consider the above example. Initially(1) , we have a very large data set containing log files, sensor data or whatnot. HDFS stores replicas of that data (represented here by the blue, yellow and beige icons) across DataNodes. In Step 2, the client defines and executes a map job and a reduce job on a particular data set, and sends them both to the Jobtracker, where in Step 3, the jobs are in turn distributed to the TaskTrackers nodes. The TaskTracker runs the mapper, and the mapper produces output that itself is stored in the HDFS file system. Lastly, in Step 4, the reduce job runs across the mapped data in order to produce the result.We’ve deliberately skipped much of the complexity involved in the MapReduce implementation, specifically the steps that provide the “sorted by key” guarantee the MapReduce framework offers to its reducers. Hadoop provides a Web-based GUI for the Namenode, Jobtracker and Tasktracker nodes: we’ll see more of this in the lab associated with this lesson.
In Pig and Hive, the presence of HDFS is very noticeable. Pig, for example, directly supports most of the Hadoop file system commands. Likewise, Hive can access data whether it’s local or stored in an HDFS. In either case, data can usually be specified via an HDFS URL (hdfs://<namenode>/path>). In the case of HBase, however, Hadoop is mostly hidden in the HBase framework, and HBase provides data to the client via a programmatic interface (usually Java).Via these interfaces, a Data Scientist can focus on manipulating large datasets without concerning themselves with the inner working of Hadoop. Of course, a Data Scientist must be aware of the constraints associated with using Hadoop for data storage, but doesn’t need to know the exact Hadoop command to check the file system.
Pig is a data flow language and an execution environment to access the MapReduce functionality of Hadoop (as well as HDFS). Pig consists of two main elements:A data flow language called Pig Latin (ig-pay atin-lay) andAn execution environment, either as a standalone system or one using HDFS for data storage.A word of caution is in order: If you only want to touch a small portion of a given dataset, then Pig is not for you, since it only knows how to read all the data presented to it. Pig only supports batch processing of data, so if you need an interactive environment, Pig isn’t for you.
The Hive systemis aimed at the Data Scientist with strong SQL skills. Think of Hive as occupying a space between Pig and an DBMS (although that DBMS doesn’t have to be a Relational DBMS [RDBMS]). In Hive, all data is stored in tables. The schema for each table is managed by Hive itself. Tables can be populated via the Hive interface, or a Hive schema can be applied to existing data stored in HDFS.
HBase represents a further layer of abstraction on Hadoop. HBase has been described as “a distributed column-oriented database [data storage system]” built of top of HDFS. Note that HBase is described as managing structured data. Each record in the table can be described as a key (treated as a byte stream) and a set of variables, each of which may be versioned. It’s not a structure in the same sense as an RDBMS is structured. HBase is a more complex system than what we have seen previously. HBase uses additional Apache Foundation open source frameworks: Zookeeper is used as a co-ordination system to maintain consistency, Hadoop for MapReduce and HDFS, and Oozie for workflow management. As a Data Scientist, you probably won’t be concerned overmuch with implementation, but it is useful to at least know the names of all the moving parts. HBase can be run from the command line, but also supports REST (Representational State Transfer – think HTTP) and Thrift and Avro interfaces via the Siteserver daemon. Thrift and Avro both provide an interface to send and receive serialized data (objects where the data is “flattened” into a byte stream).
Although HBase may look like a traditional DBMS, it isn’t.HBase is a “distributed, column-oriented data storage system that can scale tall (billions of rows), wide (billions of columns), and can be horizontally partitioned and replicated across thousands of commodity servers automatically.”The HBase table schemas mirror physical storage for efficiency; a RDBMS doesn’t. (the RDBMS schema is a logical description of the data, and implies no specific physical structuring.) Most RDBMS systems require that data must be consistent after each transaction (ACID prosperities). DBMS systems like HBase don’t suffer from these constraints, and implement eventual consistency. This means that for some systems you cannot write a value into the database and immediately read it back in. Strange, but true. Another of HBase’s strengths is in its wide open view of data – HBASE will accept almost anything it can cram into an HBase table.
Mahout is a set of machine learning algorithms that leverages Hadoop to provide both data storage and the MapReduce implementation.The mahout command is itself a script that wraps the Hadoop command and executes a requested algorithm from the Mahout job jar file (jar files are Java ARchives, and are very similar to Linux tar files [tape archives]). Parameters are passed from the command line to the class instance. Mahout mainly supports four use cases:Recommendation mining takes users' behavior and tries to find items users might like. An example of this is LinkedIn’s “People You Might Know” (PYMK). Classificationlearns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Clusteringtakes documentsand groups them into collections of topically related documents based on word occurrences. Frequent itemset mining takes a set of item groups (for example, terms in a query session, shopping cart content) and identifieswhich individual items usually appear together.If you plan on using Mahout, rememberthat these distributions (Hadoop and Mahout) anticipate running on a *nix machine, although a Cygwin environment on Windows will work as well (or rewriting the command scripts in another language, say as a batch file on Windows). It goes without saying that a compatible working version of Hadoop is required. Lastly, Mahout requires that you program in Java: no other interface outside of the command line is supported.
Greenplum Database utilizes a shared-nothing, massively parallel processing (MPP) architecture that has been designed for complex business intelligence (BI) and analytical processing. Most of today’s general-purpose relational database management systems are designed for Online Transaction Processing (OLTP) applications. The reality is that BI and analytical workloads are fundamentally different from OLTP transaction workloads and require a profoundly different architecture.The Greenplum Database is fully parallel and highly optimized for executing both SQL and MapReduce queries. Additionally, the system offers a new level of parallel analysis capabilities for data scientists, with support for SAS, R, linear algebra, and machine learning primitives; and includes extensibility for functions written in Java, C, Perl, or Python.Because of the shared nothing MPP architecture the system is linearly scalable. Simply add additional nodes and the database performance and capacity improves. Expansions are online keeping the database available for production workloads.
Logical depiction: (top portion): Logically, gNet enables data in multiple formats, that resides in Hadoop HDFS file system, to be used as though it were a table in Greenplum Database. This is the essence of processing – we can select, filter, join, modify, aggregate, essentially, all normal SQL operations, on the combination of RDBMS data in Greenplum Database and data stored in Hadoop, as though all data were in the database. The results are:Real-Time: Fast access to new data as it arrives – no waiting for reformatting and periodic movement processes to copy data into the database.Space-Efficiency: No duplication of data – big data makes any plan to duplicate data very expensive, even on so-called “cheap storage”Query Efficiency: Movement of frequently-accessed data, where moving it for local access in the database results in a desirable reduction in gNet trafficArchival: Information Lifecycles where data arrives in one platform, but as it ages, is moved to another platform to achieve lower cost of retention – consider the cost of HDFS storage – it’s low – so some customers will generate and manipulate data in the database for simplicity, but archive the data in Hadoop. With co-processing over gNet, the data remains available even after it’s been archived in HDFS files.
The Greenplum Database was conceived, designed and engineered to allow customers to take advantage of large clusters of increasingly powerful and economical general purpose servers, storage and ethernet switches. With this approach, EMC Greenplum customers can gain immediate benefit from the industry’s latest computing innovations.Greenplum’s MPP shared-nothing architecture delivers industry-leading performance in big data. You can compare the impact to finding a specific card—let’s say the Ace of Spades—in a deck. If you do it yourself, it could take you up to 52 tries to find the Ace of Spades. If you distribute it to 26 people, it will only take up to 2 tries. Likewise, Greenplum distributes processing across nodes—and these nodes work independently and in parallel to quickly deliver answers.
Please take a moment to answer these questions. Record your answers here.
As background, it is important to understand that Business Intelligence is different than data science and analytics. BI deals with reporting on history. What happened last quarter? How many did we sell, etc.Data science is about predicting the future and understanding why things happen. What is the optimal solution? What will happen next?For many companies data science is a new approach to understanding the business yet an important one to undertake today.
Here are 5 main competency and behavioral characteristics for Data Scientists.Quantitative skills, such as mathematics or statistics Technical aptitude, such as software engineering, machine learning, and programming skills. Skeptical…..this may be a counterintuitive trait, although it is important that data scientists can examine their work critically rather than in a one-sided way.Curious & Creative, data scientists must be passionate about data and finding creative ways to solve problems and portray informationCommunicative & Collaborative: it is not enough to have strong quantitative skills or engineering skills. To make a project resonate, you must be able to articulate the business value in a clear way, and work collaboratively with project sponsors and key stakeholders.
In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
We can have an *animated* view of color-coded traffic volumes on Google Earth over a user-specified period. The file that produces the animation is created within Greenplum. The Google Map display is similar to this, but it only provides traffic volume at a specific time.
Eight banks become oneBranches across the USConsolidation of products and customersEmployees faced with new products and customersOld does not necessarily equal newWhat to recommend to customers?Needs to make the bank moneyNeeds to make the customer moneyOverlap with existing products is challengingCost of acquiring a new customer is significantly higher than selling additional products to existing customers
Here’s an example in which we used clustering techniques (grouping similar objects together) and a form of “market basket analysis” (if you bought one set of products, you might be interested in another) to create a simple product recommendation engine.First, we defined a measurement of customer value. (For this particular customer, they already had a way of computing that, but it took 20 hours to run in a separate database. Now it runs in Greenplum in less than an hour, so they run it regularly as part of their ETL process.)Next, we created groups of customers based on product usage. We did this by defining a “distance” between customers so that those who owned a similar assortment of products would be measured as being close. We then used this notion of distance to identify clusters of customers.
Then we used various methods, including “association rules” (the technique used in market basket analysis, on sites such as Amazon), to identify common product associations. In other words, by looking at product usage across millions of customers, we found that certain groups of products tended to be occur together. By restricting our analysis to a certain segment of the population (in this case, based on customer value), we were more likely to find product groupings that made sense for that customer segment.
We used these results to make product recommendations. For a given customer, we used the product associations to determine which new products made sense. Then we filtered out products that were disproportionately associated with customers of lower value. The remaining products were then more likely to move the customer into a higher value. The client referred to this as “filling incomplete baskets”.VerticalsThis applies to any organization that advertises to a sufficiently large number of customers.
Modern applications need to respond faster and capture more information so the business can perform the analysis needed for the making the best business decision. By combining the best online transactional processing (OLTP) product and the best online analytical processing (OLAP) we can create a platform that enables businesses to make the best of historical and real-time data. By utilizing the strengths of both OLTP and OLAP systems we can create platform that can cover the others weakness. Traditionally OLAP databases excel at handling petabytes of information but are not geared for fine-grained low latency access. Similarly OLTP excel at fine-grained low latency access but may fall short of handling large scale data sets with ad-hoc queries. To solve the OLTP aspect of this problem we have chosen vFabric SQLFire. SQLFire is a memory-optimized shared-nothing distributed SQL database delivering dynamic scalability and high performance for data-intensive modern applications. SQLFire’s memory-optimized architecture minimizes time spent waiting for disk access, the main performance bottleneck in traditional databases. SQLFire achieves dramatic scaling by pooling memory, CPU and network bandwidth across a cluster of machines and can manage data across geographies. For the OLAP aspect we will be looking at EMC Greenplum. Greenplum was built to support Big Data Analytics, Greenplum Database manages, stores, and analyzes Terabytes to Petabytes of data. Users experience 10 to 100 times better performance over traditional RDBMS products – a result of Greenplum’s shared-nothing massively parallel processing architecture, high-performance parallel dataflow engine, and advanced gNet software interconnect technology.