9. Agenda JovianDATA Company Overview JovianInsights – The Power of Analytics JovianDATA Cube Storage Innovations in Advanced Analytics using commodity clusters Analytics Lifecycle Management Innovations in Cloud Infrastructure Management
10. Avoiding Expensive Data Processing Usage based Automatic View Materialization Avoid Network I/O Multi-Dimensional Partitioning Reduce Disk I/O By Materializing Expensive Groups
13. Managing CapEx with Role Based Clusters SINGLE CLUSTER FOR DATA CLEANSING, LOAD AND QUERY 15TB 100 NODES Monthly Cost = $28,800
14. Managing Cap-Ex with Role Based Clusters UI Ad Server Data, Search Engine Data 2 hours daily for load on 10 nodes Query on 5 nodes Monthly Cost = $2,052 DATA CLEANSING QUERY LOAD MODEL HIBERNATE MODEL
24. But only when you need it to hold down operating costsNode1 Node2 Node3 Node4 P34 P1 P12 P1 P22 P3 P3 P12 P12 P22 P34 P22 Nodeset1 P3 P34 P1 P3 Temp1 Temp2 P34 P1 P22 P12
26. Provision Tera Scale Applications in Minutes Without Application Isolation Data for all advertisers is kept ‘live’ on 50 nodes Campaign Manager needs to run heavy duty reports for a Big Advertiser 50 live nodes per month = $14, 400 FUNNEL ANALYSIS FOR CLIENT
27. Provision Tera Scale Applications in Minutes Application is provisioned in parallel from S3/EBS into EC2 Campaign Manager requests Application Provisioning for a Specific Advertiser 50 nodes for fortnightly analysis = $320 FUNNEL ANALYSIS FOR CLIENT HIBERNATED MODEL
28. Summary Reducing CapEx with Role based Temporary Clusters on EC2 10x Cost Savings with EC2 usage Dynamic Provisioning with Selective Replication on EC2 10x Performance on EC2 replication Application Isolation with Application Hibernation on S3/EBS 100x Cost Savings with EC2-S3
JovianDATA’s mission is to provide a technology platform to help users optimize the entire digital marketing funnel at the lowest cost. <next>
At its core, JovianDATA solves the “Analytics on large data” problem. Our customers have huge amounts of digital data and they had 3 central challenges while trying to analyze itHuge upfront and ongoing CapexHuge Maintenance and over provisionisgLack of application richness All of then want to move to the cloud <next>They are not sure aboutCapEx benefitsReadiness for the cloud to support complex application stacks typical of such installations which leads to second order problemApplication Provisioning challenges <next>
We believe a fully integrated stack built on the cloud using sophisticated distributed technology with commodity components is key to high performance low cost solution for tackling large data. AWS’s rich cloud functionality and JovianDATA distributed technology makes analytics on large data on a cloud possible. We can takeImpression data Site data And marry them with 3rd party data & sales data thereby providing a unified correlated and sophisticated analysis to various players in the ad ecosystem.<next>
Let me present a brief overview of the system, before we go into the details later in the presentation JoviandataSaaS system takes data, provides rich analytics and handles everything in-betweenA distributed ETL layer built on top of Java & Mysql takes raw data and applies single event filters transformations.The data is then loaded into our massively parallel warehouse, where more complex rules which look at all the data (as opposed to single rows) are applied. We also collect statistics here which are used for data cleansing and as well to size the model <next>Using the statistics we build a proprietary array based structure, which provides a illusion of a fully materialized cube. <next> The structure is distributed across the cluster.Then a MDX engine takes a multi-dimensional query, break them into tuples and calculates them in parallel across the cluster. One of the key aspects of the JovianDATA system, is that the load involves complex workflows and we have a framework to manage these workflows very efficiently.The results are apparent. In one of the test ran by a customer, 10 users ran 450 reports on a 2.5 TB warehouse for 6 hours. 90% of the reports returned in less than 10 secs. 40% of the reports in less than 100 ms. The longest a report took was 113 secs.Now Anupam will go into the details of the system. <next>
CLICK 1 :- Lets contrast Data Processing on the cloud with the conventional enterprise data center. A user comes in to run analytics.CLICK 2 :- One of the biggest misconception about the cloud is that all you need to do is to let the analyst connect to a datacetner in the cloud. All you need to do is to create an AMI with your favorite software (MySQL, Hadoop, Oracle etc) and bring up hundreds of instances in the cloud. This is like saying EC2 is just about ec2-run-instances. At JovianDATA, we have created an analytics engine that has revisited three key expensive operations in classical stacks. Each operation has been re-written to exploit the cloud.CLICK 3 :- Instead of keeping large clusters to run expensive grouping, we believe in bringing up an extraordinary amount of nodes for a short time. CLICK 4 :- This pre-calculation allows us to Reduce Disk I/O requirements at run time.CLICK 5 :- Most clouds do not excel in inter-processor communication. A big reason for network I/O is the notion of joining tables in databases. CLICK 6 :- We eliminate joining of large tables for the cube structure by using a patent-pending partitioning technique for multi-dimensional data.CLICK 7 :- To allow many users to load the same reports again and again, many of our digital media customers use materialized views which require DBA intervention. CLICK 8 :- Instead, we materialize views based on the customer’s usage without a DBA. This allows often-used data to be available to multitude of users.
JovianDATA works with customers which generate 10s of terabytes of data. In 2 years, we have identified 3 major problems that compel enterprise customers to move their analytics on the cloud.Capital expenditure. To get a BI project on 10 terabytes requires nearly an year of planning with an upfront six figure investment just to load and process the data. In this presentation, we will show how you don’t have to sign up for a six figure sum to try out 10 TB analytics.Over Provisioning. Anybody who has worked with a real world analytics stack knows that half the cluster lies underutilized nearly all the time. To add insult to injury, there are periods when the entire cluster lies unused some 20-30% of time like weekends and nights. All these nodes were allocated because of some peak usage on Monday morning. We will show you how we avoid provisioning for peak. Instead, we provision based on usage.Application Isolation. Even after assigning 100s of nodes to a task, applications keep running into each other in a real life deployment. The cloud provides a great opportunity to provision applications in their own sandboxes.
CLICK 1 :- Lets look at a classical analytics stack for 15 TB of data … A monolithic stack deployed in the cloud seldom exploits the cloud. In most cases, all it can take care of is expansion. CLICK 2 :- If we keep 15TB up on 100 nodes, it might cost upto$28,800.Is it cheaper than maintaining your own datacenter? Absolutely? Does it really use the cloud’s capability to use-as-you-go? Absolutely not.Lets look at an architecture that was built for the cloud rather than just retro-fitting of an old architecture.
In JovianDATA, nodes are allocated based on the need for a particular stage of data processing. CLICK 1 :- Here we shown an example flow where data becomes available sometime at night at the DoubleClick ftp server. CLICK 2 :- As the moon rises (so to speak), the Data Cleansing stage starts off and completes the model building in the hours of the night. CLICK 3 :- The generated model is then hibernated to S3 or EBS. The data cleansing and load model building clusters are terminated.CLICK 4 :- As the sun rises (so to speak), the query cluster is allocated and the model is restored. Our experience here is that even though there are no rampant node failures on EC2, we see failures during transportation of data from one cluster to another. For that, we have invented a patent pending technology which tracks data transportation minutely to make sure data does not disappear while moving from one stage to another.CLICK 5:- The main message here is that we never create a ‘full’ cluster. Instead, we employ role based clusters. This has a dramatic effect on cost. We believe that enterprise software stacks need to move to role based cluster if they want to get 10x savings. Otherwise, they still have the cap ex of allocating hundreds of nodes in the cloud.
Almost everybody talks about Cloud Computing enabling on demand performance. But, how easy is it to dial up performance based on usage? Is it just about simply adding more nodes to the cluster? How should the data be redistributed? Should it be done blindly?CLICK 1 :- At JovianDATA, we believe in selective replication to enable performance. Lets see selective replication in action? CLICK 2 :- Lets say a customer is running a Site Section report. If the report takes 10 minutes, we will dynamically provision 2 new nodes. But, it does not make sense to just copy data over to those nodes. CLICK 3 :- Instead, we create copies of the hottest partitions on these new nodes. The hot-ness of data is defined by a statistics package that provides complete visibility of the usage of the cluster.CLICK 4 : - With selective replication of hot data, the report will get generated in 30 seconds rather than 10 minutes. CLICK 5 :- Statistics and automatic algorithms are improtant because we have to do this selective replication on terabytes of data
Once the usage goes down, the nodes will be returned back through terminate-instance and the replicas will vanish. Thus, we have increased performance by nearly 20 times without increasing the cost of the analytics stack.They key message here is that blind addition of nodes is not dynamic provisioning. Dynamic Provisioning should work in hand in hand with intelligent replication of data.
The next big use case is application isolation. CLICK 1 :- Consider the use case of a Campaign Manager who needs to run a segmentation report which requires a heavy usage of the cluster. The non-cloud option for a customer is to keep the application running 24x7. This is because you never know when the Campaign Manager needs to run this intense report. The other options is to ‘share’ the application across operational reporting and intense analytics. A 24x7 cluster is characterized by high cost. A shared cluster is characterized by maintenance headaches when applications keep running into each other.CLICK 2 :- Keeping an application running in a large, shared cluster has a dramatic effect on the cost. A 50 node application could cost as much $14,400 per monthJovianDATA brings a third option to the table which exploits the cloud economics of cheap storage and elastic provisioning of CPUs.
In the scenario which we have often seen with our customers, the analyst comes in, lets say, once every fortnight to do deep analytics. CLICK 1 :- With JovianDATA, the analyst puts in a request for a cluster. The cluster gets allocated on demand. CLICK 2 :- The application Is then brought to life with a parallel restore from S3. With our parallel restore, we have seen as low as 30 minutes for 5-10TB cluster. CLICK 3 :- At JovianDATA, we do these restores by running parallel jobs which transfer data from S3 into EC2. The restore takes care of network failures as well software issues. Once the application is fully restored, the analyst can run their intense analytics in complete isolation from other analysts. Once their analysis is done, they can de-provision the cluster.CLICK 4 :- The cost savings of running on-demand clusters for deep analytics are dramatic. A 24/7 cluster would mean thousands of dollars in recurring expenditure. An on-demand cluster would require 100s of dollars and only when the analyst really uses the cluster.
In conclusion, at JovianDATA, we believe that 3 key things are necessary for a ‘real’ cloud computing solution :-EC2 should be exploited to have near 0 capex. If the software is deployed in the classic enterprise cap-ex intensive model, then you are leaving 10x of Cost Savings on the table.EC2 abilities to bring computing in minutes should be exploited to increase performance on demand. If your analytics takes days to improve performance and requires intense DBA intervention, then its not exploiting the cloud. 10x performance should be available in minutes.Using S3 for full application hibernation allows EC2 nodes to be de-provisioned. EC2 nodes should be provisioned only when a customer needs to run intense analytics on a periodic basis. Idle nodes are the biggest no-no and you would be leaving 100x in cost savings on the table if you are not using S3 effectively.