Mais conteúdo relacionado


Lesson 2.docx

  1. Lesson 2 Data Mart, Data Cube and OLAP
  2. Data Marts: What They Are and Why Businesses Need Them Imagine you run a candy store. Some of the goodies are on display cases for quick access while the rest is in the storeroom. Now let’s think of the sweets as the data required for your company’s daily operations. Instead of combing through the vast amounts of all organizational data stored in a data warehouse, you can use a data mart — a repository that makes specific pieces of data available quickly to any given business unit. Just like display cases in a store. This article is going to provide an in-depth explanation of what data marts are and how they store data for Business Intelligence purposes. You’ll also find out about the key types of data marts, their structure schemas, implementation steps, and more. What is a data mart? A data mart is a smaller subsection of a data warehouse built specifically for a particular subject area, business function, or group of users. The main idea is to provide a specific part of an organization with data that is the most relevant for their analytical needs. For example, the sales or finance teams can use a data mart containing sales information only to make quarterly or yearly reports and projections. Since data marts
  3. provide analytical capabilities for a restricted area of a data warehouse, they offer isolated security and isolated performance. Data mart vs data warehouse vs data lake vs OLAP cube Data lakes, data warehouses, and data marts are all data repositories of different sizes. Apart from the size, there are other significant characteristics to highlight.
  4. A data warehouse (DW) is a data repository that enables storing and managing all the historical enterprise data, coming from disparate internal and external sources like CRMs, ERPs, flat files, etc. Initially, DWs dealt with structured data presented in tabular forms. Modern cloud warehouses make it possible to store data in its raw formats similar to what data lakes do. While cloud solutions are quicker to set up, on-premise DWs may take months to build. A data mart is a subject-oriented relational database commonly containing a subset of DW data that is specific to a particular business department of an enterprise, e.g., a marketing department. Data marts get information from relatively few sources and are small in size — less than 100 GB. They typically contain structured data and take less time for setup — normally 3 to 6 months for on-premise solutions. A data lake is a central repository used to store massive amounts of both structured and unstructured data coming from a great variety of sources. Data lakes accept raw data, eliminating the need for prior cleansing and processing. As far as the size, they can be home to many files, where even one file can be larger than 100 GB. Depending on the goal, it may take weeks or months to set up a data lake. Moreover, not all organizations use data lakes.
  5. Data marts shouldn’t be confused with OLAP cubes either. An OLAP or Online Analytical Processing cube is the tool used to represent data for analysis in a multidimensional way. So, just like data warehouses, data marts can be used as the foundation for creating an OLAP cube. For example, a company has a data mart containing all the financial data. The company may wish to model an OLAP cube to summarize this data by different dimensions: by time, by product, or by city, to name a few. Types of data marts Based on how data marts are related to the data warehouse as well as external and internal data sources, they can be categorized as dependent, independent, and hybrid. Let’s elaborate on each one.
  6. Dependent data marts are the subdivisions of a larger data warehouse that serves as a centralized data source. This is something known as the top-down approach — you first create a data warehouse and then design data marts on top of it. Within this sort of relationship, data marts do not interact with data sources directly. Based on the subjects, different sets of data are clustered inside a data warehouse, restructured, and loaded into respective data marts from where they can be queried. Dependent data marts are well suited for larger companies that need better control over the systems, improved performance, and lower telecommunication costs.
  7. Independent data marts act as standalone systems, meaning they can work without a data warehouse. They receive data from external and internal data sources directly. The data presented in independent data marts can be then used for the creation of a data warehouse. This approach is called bottom-up. Often, the motivation behind choosing independent data marts is shorter time to market. They work great for small to medium-sized companies. So, the key difference between dependent and independent data marts is in the way they get data from sources. The step involving data transfer, filtering, and loading into either a data warehouse or data mart is called the extract-transform-load (ELT) process. When dealing with dependent data marts, the central data warehouse already keeps data formatted and cleansed, so ETL tools will do little work. On the other hand, independent data marts require the complete ETL process for data to be injected.
  8. Hybrid data marts integrate data from all existing operational data sources and/or data warehouses. This method collects the benefits and addresses the issues of both top-down and bottom-up approaches. Hybrid data marts are a good choice for organizations that have multiple databases. Data mart structure schemas Similar to traditional data warehouses, data marts use a relational approach to data modeling. A relation is a mathematical term for a table, which is a combination of rows and columns containing different values. To logically arrange pieces of data in a data mart, companies use two main schemas — star and snowflake. Both consist of a fact table and dimension tables with different levels of joints.
  9. Star schema, as the name suggests, resembles a star. It comprises only one fact table that is placed in the center of the model and breaks down into several dimension tables with denormalized data. This means that the data is redundant and that results in faster data retrieval as fewer joins are needed. The fact table encompasses aggregated data designed to be used for analytical and reporting purposes while the dimension tables contain descriptions of the stored data. The star schema is a simple type of data mart structure as the fact table has only one link to each dimension table. As such, this model makes it easier to accomplish complex queries.
  10. Snowflake schema has the star schema as its base, yet the data in dimension tables is normalized as it is split into additional dimension tables. The normalization of the dimension tables in the snowflake schema is reached by getting rid of attributes with few unique values and forming separate tables. Such an arrangement forms a sort of snowflake, hence the name of the schema. Though the snowflake schema protects data integrity more efficiently and takes up less disk space, querying becomes more complex because of many levels of joins between tables. Data mart implementation steps The process of creating data marts may be complicated and differ depending on the needs of a particular company. In most cases, there are five core steps such as designing a data mart, constructing it, transferring data, configuring access to a repository, and finally managing it. We’ll walk you through each step in more detail. Data mart designing The first thing you do when implementing a data mart is deciding on the scope of the project and its design. Since data marts are subject-oriented databases, this step involves determining a subject or a topic to which data stored in a mart will be related. In addition to collecting information about technical specifications, you need to decide on business requirements during this phase too. It is also necessary to identify the data sources related to the subject and design the logical and physical structure of the data mart. Data mart constructing Once the scope of work is established, here comes the second step that involves constructing the logical and physical structures of the data mart architecture designed during the first phase.
  11.  Logical structure refers to the scenario where data exists in the form of virtual tables or views separated from the warehouse logically, not physically. Virtual data marts may be a good option when resources are limited.  Physical structure refers to the scenario where a database is physically separated from the warehouse. The database may be cloud-based or on-premises. Also, this step requires the creation of the schema objects (e.g., tables, indexes) and setting up data access structures. It is essential to perform a detailed requirement collection before implementing any scenario since different organizations may need different types of data marts. Data transferring The third step covers all the tasks related to transferring data from sources to data marts:  extracting information from target data sources,  cleansing and converting data into a fitting format, and  loading data into a data mart. To perform the processes of extraction, transformation, and loading, ETL tools are used. Data access configuring Now that data is in data marts, it’s time to put it to use: making queries, analyzing data, creating reports, etc. The accessing step involves the following tasks:
  12.  setting up the intermediate (meta) layer for the front-end application (the layer converts database structures into business terms so that end clients can access data from data marts easily);  setting up and managing database structures like summarized tables; and  setting up APIs (application programming interfaces) if required. Data marts can be accessed via a command line or GUI (graphical user interface), which is a more user- friendly option. Managing The final step of the data mart implementation process encompasses different management tasks like:  providing secure user access to data;  optimizing and fine-tuning the system for better performance;  adding and managing new data; and  ensuring system availability and planning recovery scenarios. Data mart use cases Companies can become more agile and data-driven with the right approach to business intelligence and data analytics. Data marts were initially created to help companies make more informed business decisions and address unique organizational problems — those specific to one or several departments. There are quite a few cases where data marts can be used. We’ll cover the typical ones in this next paragraph.
  13. Subject-focused data analytics Data analytics play a crucial role in any business lifecycle. Data marts allow for more focused data analysis because they only contain records organized around specific subjects such as products, sales, customers, etc. Since there’s no extraneous information, businesses can discern clearer and more accurate insights. For example, data marts can be used as on-premise or cloud-based destinations to consolidate all the marketing data and store it in a structured format. This allows marketing teams to reach a single source of truth and get a better handle on important metrics such as the return of investment (ROI), customer acquisition cost (CAC), and return on ad spend (ROAS). Data marts provide easy and fast access to important data points when needed. They can process complex queries and push the required data into corresponding reporting and data analytics tools. Selective data access Data marts can be used in situations when an organization needs selective privileges for accessing and managing data. This is often the case for big enterprises that can’t expose the entire data warehouse to all users. Building multiple dependent data marts can help protect sensitive data from unauthorized access and accidental writes. Improved resource management Providing each department with a separate data mart can be a good way to manage the imbalance of resource use by different organizational units. Say, the department running logistics operations does a lot of actions with a database daily. This may cause system malfunctions of other departments that perform fewer database queries. Eventually, this may decrease the performance effectiveness of the whole company. Data marts allow for using resources efficiently and effectively.
  14. Time-limited data projects Compared to corporate data warehouses that require significant time and effort, data marts are much easier and faster to set up: Data engineers and developers work with smaller amounts of data, fewer sources, and simpler schemas. On top of that, data marts are cheaper to implement than a DW. So, if you have time limitations in terms of completing a data project, data marts may be the way to go. The “cloud-y” future of data marts Businesses face an endless growth of information. Getting actionable, data-driven insights becomes difficult for those still using on-premises solutions. In the Big Data reality, data warehouses are progressively moving to the cloud — and so are data marts. Cloud solutions facilitate storing and sharing massive sets of data unlocking the true power of effective data analysis. Cloud-based platforms offer flexible architectures with separate data storage and compute powers, resulting in better scalability and faster data querying. With a single repository containing all data marts in the cloud, businesses can not only lower costs but also provide all departments with unhindered access to data in real- time. In addition, cloud data marts can be a great tool for machine learning purposes. Data marts contain all the relevant information connected to transactions, products, or customers for a given period of time. Because they’re credible, they can be used to build different ML models such as propensity models predicting customer churn or those providing personalized recommendations.
  15. Data Cube What Does Data Cube Mean? A data cube refers is a three-dimensional (3D) (or higher) range of values that are generally used to explain the time sequence of an image's data. It is a data abstraction to evaluate aggregated data from a variety of viewpoints. It is also useful for imaging spectroscopy as a spectrally-resolved image is depicted as a 3-D volume. A data cube can also be described as the multidimensional extensions of two- dimensional tables. It can be viewed as a collection of identical 2-D tables stacked upon one another. Data cubes are used to represent data that is too complex to be described by a table of columns and rows. As such, data cubes can go far beyond 3-D to include many more dimensions. Techopedia Explains Data Cube A data cube is generally used to easily interpret data. It is especially useful when representing data together with dimensions as certain measures of business requirements. A cube's every dimension represents certain characteristic of the database, for example, daily, monthly or yearly sales. The data included inside a data cube makes it possible analyze almost all the figures
  16. for virtually any or all customers, sales agents, products, and much more. Thus, a data cube can help to establish trends and analyze performance. Data cubes are mainly categorized into two categories:  Multidimensional Data Cube: Most OLAP products are developed based on a structure where the cube is patterned as a multidimensional array. These multidimensional OLAP (MOLAP) products usually offers improved performance when compared to other approaches mainly because they can be indexed directly into the structure of the data cube to gather subsets of data. When the number of dimensions is greater, the cube becomes sparser. That means that several cells that represent particular attribute combinations will not contain any aggregated data. This in turn boosts the storage requirements, which may reach undesirable levels at times, making the MOLAP solution untenable for huge data sets with many dimensions. Compression techniques might help; however, their use can damage the natural indexing of MOLAP.  Relational OLAP: Relational OLAP make use of the relational database model. The ROLAP data cube is employed as a bunch of relational tables (approximately twice as many as the quantity of dimensions) compared to
  17. a multidimensional array. Each one of these tables, known as a cuboid, signifies a specific view.
  18. OLAP (online analytical processing) OLAP (online analytical processing) is a computing method that enables users to easily and selectively extract and query data in order to analyze it from different points of view. OLAP business intelligence queries often aid in trends analysis, financial reporting, sales forecasting, budgeting and other planning purposes. For example, a user can request that data be analyzed to display a spreadsheet showing all of a company's beach ball products sold in Florida in the month of July, compare revenue figures with those for the same products in September and then see a comparison of other product sales in Florida in the same time period. How OLAP systems work To facilitate this kind of analysis, data is collected from multiple data sources and stored in data warehouses then cleansed and organized into data cubes. Each OLAP cube contains data categorized by dimensions (such as customers, geographic sales region and time period) derived by dimensional tables in the data warehouses. Dimensions are then populated by members (such as customer names, countries and months) that are organized hierarchically. OLAP cubes are often pre-summarized across dimensions to drastically improve query time over relational databases. Analysts can then perform five types of OLAP analytical operations against these multidimensional databases:  Roll-up. Also known as consolidation, or drill-up, this operation summarizes the data along the dimension.
  19.  Drill-down. This allows analysts to navigate deeper among the dimensions of data, for example drilling down from "time period" to "years" and "months" to chart sales growth for a product.  Slice. This enables an analyst to take one level of information for display, such as "sales in 2017."  Dice. This allows an analyst to select data from multiple dimensions to analyze, such as "sales of blue beach balls in Iowa in 2017."  Pivot. Analysts can gain a new view of data by rotating the data axes of the cube. OLAP software then locates the intersection of dimensions, such as all products sold in the Eastern region above a certain price during a certain time period, and displays them. The result is the "measure"; each OLAP cube has at least one to perhaps hundreds of measures, which are derived from information stored in fact tables in the data warehouse.
  20. Types of OLAP systems OLAP (online analytical processing) systems typically fall into one of three types: Multidimensional OLAP (MOLAP) is OLAP that indexes directly into a multidimensional database. Relational OLAP (ROLAP) is OLAP that performs dynamic multidimensional analysis of data stored in a relational database. Hybrid OLAP (HOLAP) is a combination of ROLAP and MOLAP. HOLAP was developed to combine the greater data capacity of ROLAP with the superior processing capability of MOLAP.
  21. Uses of OLAP OLAP can be used for data mining or the discovery of previously undiscerned relationships between data items. An OLAP database does not need to be as large as a data warehouse, since not all transactional data is needed for trend analysis. Using Open Database Connectivity (ODBC), data can be imported from existing relational databases to create a multidimensional database for OLAP. OLAP products include IBM Cognos, Oracle OLAP and Oracle Essbase. OLAP features are also included in tools such as Microsoft Excel and Microsoft SQL Server's Analysis Services). OLAP products are typically designed for multiple-user environments, with the cost of the software based on the number of users.