2. AGENDA
Data warehouse and BI overview
Data warehouse Data Flow
Staging Area
Transformation
Loading
ETL tools
Data Marts
Business Intelligence (BI)
OLAP
BIG DATA
3. DATA WAREHOUSE AND BI OVERVIEW
• A data warehouse is a database that is designed for query and analysis
rather than for transaction processing. It usually contains historical data
derived from transaction data, but it can include data from other sources. It
separates analysis workload from transaction workload and enables an
organization to consolidate data from several sources.
• In addition to a relational database, a data warehouse environment includes
an extraction, transformation, and loading (ETL) solution, an online
analytical processing (OLAP) engine, client analysis tools, and other
applications that manage the process of gathering data and delivering it to
business users.
• Business intelligence (BI) is defined as the ability for an organization to
take all its capabilities and convert them into knowledge. This produces
large amounts of information which can lead to the development of new
opportunities for the organization.
4. FEW KEY IMPORTANT WORDS
• Business Operation
• Business Intelligence
• Business Management
• Operational System
• Data Warehouse
• Operational Data store
• Data Mart
• Meta Data Management
5. STEPS TO CREATE A DATAWAREHOUSE
• Understand the business problem to be solved
• Gather requirements
• Determine appropriate end user technology to support the solution
• Build a prototype
• Develop data warehouse data model
• Map the DW requirements based on the user’s requirement definitions
• Generate ETL code
• Test the DW
• Once validate, move the data and code to Production
6. SUBJECT
• Referred as subject oriented data warehouse
• Subject refers to data subject or major category of data relevant to business.
• Subset of enterprise data and consist of related entities and relationship.
• Examples Customers,Products,Sales,Geo
7. ENTITY
• Defined as person ,place, thing concept or relevant in which an enterprise has both
interest and capability to capture and store information
• Primary entity – defined as an entity that does not depend on any other entity for its
existance
• SUBTYPE Entity – is logical division of or category of a parent (super type) entity.
Examples – Customers can be Wholesale customers and Retail customers. Both inherits
parent attributes of parent entity.
• Attribute - It handles a group of data for an entity that can occur multiple times.
• Associative Entity - it depends upon 2 or more entities for its existence . Like Orders
consists of Customer and Items purchased.
• Primary Key – Servers as unique identifier for an Entity and is used in the physical
database to locate a record for storage or access
8. CHARACTERISTICS OF A PRIMARY KEY (PK)
• The key is never NULL
• The key is unique and unique by design and not by circumstances
• The key is persistence over the time
• The key is manageable – consists of integers and characters strings
and no embedded symbols or odd characters
• The key should not contain any embedded intelligence
9. RELATIONSHIP
• Relationship documents the business rules associating two entities together. The relationship is used to
describe how the two entries are naturally linked to each other.
• Example Customers can place orders.
• Cardinality *** - denotes the maximum number of occurrence of one entity to another that can relate to
another entity. Usually these are expressed as “ONE” or “MANY”
• Identifying Relationship – An identifying relationship means that the child table cannot be uniquely
identified without the parent
• Example...
Account (AccountID, AccountNum, AccountTypeID)
PersonAccount (AccountID, PersonID, Balance)
Person(PersonID, Name)
• The Account to PersonAccount relationship and the Person to PersonAccount relationship are identifying because the child
row (PersonAccount) cannot exist without having been defined in the parent (Account or Person). In other words: there is no
personaccount when there is no Person or when there is no Account.
• NON Identifying relationship - A non-identifying relationship is one where the child can be identified
independently of the parent
• Example...
Account( AccountID, AccountNum, AccountTypeID )
AccountType( AccountTypeID, Code, Name, Description )
• The relationship between Account and AccountType is non-identifying because each AccountType can be identified without
having to exist in the parent table.
10. NORMALIZATION
• Normalization is the process of efficiently organizing data in a database. There are two
goals of the normalization process: eliminating redundant data (for example, storing the
same data in more than one t) and ensuring data dependencies make sense (only storing
related data in a table). Both of these are worthy goals as they reduce the amount of
space a database consumes and ensure that data is logically stored.
• The database community has developed a series of guidelines for ensuring that
databases are normalized. These are referred to as normal forms and are numbered from
one (the lowest form of normalization, referred to as first normal form or 1NF) through
three (third normal form or 3NF).
11. FIRST NORMAL FORM (1NF)
• Eliminate duplicative columns from the same table.
• Create separate tables for each group of related data and identify each row with a unique
column or set of columns (the primary key).
• The first rule dictates that we must not duplicate data within the same row of a table.
Within the database community, this concept is referred to as the atomicity of a table.
Tables that comply with this rule are said to be atomic.
• Let’s explore this principle with a classic example – a table within a human resources
database that stores the manager-subordinate relationship. For the purposes of our
example, we’ll impose the business rule that each manager may have one or more
subordinates while each subordinate may have only one manager.
12. Option
STUDENT 1: Make a determinant of the repeating
group (or the multivalued attribute) a part of the
Stud_ID Name Course_ID Units
primary key.
101 Lennon MSI 250 3.00
101 Lennon MSI 415 3.00
125 Johnson MSI 331 3.00
Composite
Primary Key
STUDENT
Stud_ID Name Course_ID Units
101 Lennon MSI 250 3.00
101 Lennon MSI 415 3.00
125 Johnson MSI 331 3.00
13. Option 1: Make a of the repeating
determinant
group (or the multivalued attribute) a part of
Composite
the primary key.
Primary Key
STUDENT
Stud_ID Name Course_ID Units
101 Lennon MSI 250 3.00
101 Lennon MSI 415 3.00
125 Johnson MSI 331 3.00
14. Option 2: Remove the entire repeating group from the relation.
Create another relation which would contain all the attributes of
the repeating group, plus the primary key from the first relation.
In this new relation, the primary key from the original relation
and the determinant of the repeating group will comprise a
primary key.
STUDENT
Stud_ID Name Course_ID Units
101 Lennon MSI 250 3.00
101 Lennon MSI 415 3.00
125 Johnson MSI 331 3.00
15. STUDENT
Stud_ID Name
101 Lennon
125 Jonson
STUDENT_COURSE
Stud_ID Course Units
101 MSI 250 3
101 MSI 415 3
125 MSI 331 3
16. SECOND NORMAL FORM (2NF)
• Goal: Remove Partial Dependencies
Composite Partial Dependencies
Primary Key
STUDENT
Stud_ID Name Course_ID Units
101 Lennon MSI 250 3.00
101 Lennon MSI 415 3.00
125 Johnson MSI 331 3.00
17. CUSTOMER STUDENT_COURSE
Stud_ID Name Course_ID Units Stud_ID Cours _ID
e
101 Lennon MSI 250 3.00 101 MSI 250
101 Lennon MSI 415 3.00 101 MSI 415
125 Johnson MSI 331 3.00 125 MSI 331
STUDENT COURSE
Stud_ID Name Course_ID Units
101 Lennon
MSI 250 3.00
101 Lennon
MSI 415 3.00
125 Johnson
MSI 331 3.00
18. THIRD NORMAL FORM (3NF)
• Goal: Get rid of transitive dependencies.
Transitive Dependency
EMPLOYEE
Emp_ID F_Name L_Name Dept_ID Dept_Name
111 Mary Jones 1 Acct
122 Sarah Smith 2 Mktg
19. THIRD NORMAL FORM (3NF)
• Remove the attributes, which are dependent on a non-key
attribute, from the original relation. For each transitive
dependency, create a new relation with the non-key attribute
which is a determinant in the transitive dependency as a
primary key, and the dependent non-key attribute as a
dependent.
EMPLOYEE
Emp_ID F_Name L_Name Dept_ID Dept_Name
111 Mary Jones 1 Acct
122 Sarah Smith 2 Mktg
20. THIRD NORMAL FORM (3NF)
EMPLOYEE
Emp_ID F_Name L_Name Dept_ID Dept_Name
111 Mary Jones 1 Acct
EMPLOYEE
122 Sarah Smith 2 Mktg
Emp_ID F_Name L_Name Dept_ID
111 Mary Jones 1
122 Sarah Smith 2
DEPARTMENT
Dept_ID Dept_Name
1 Acct
2 Mktg
22. ZACHMAN FRAMEWORK FOR ENTERPRISE ARCHITECTURES
• As you can see from Figure 4, there are 36 intersecting cells in a Zachman grid—one for each
meeting point between a player's perspective (for example, business owner) and a descriptive
focus (for example, data.). As we move horizontally (for example, left to right) in the grid, we
see different descriptions of the system—all from the same player's perspective. As we move
vertically in the grid (for example, top to bottom), we see a single focus, but change the player
from whose perspective we are viewing that focus.
• The first suggestion of the Zachman taxonomy is that every architectural artifact should live in
one and only one cell. There should be no ambiguity about where a particular artifact lives. If it
is not clear in which cell a particular artifact lives, there is most likely a problem with the artifact
itself.
• The second suggestion of the Zachman taxonomy is that an architecture can be considered
a complete architecture only when every cell in that architecture is complete. A cell is complete
when it contains sufficient artifacts to fully define the system for one specific player looking at
one specific descriptive focus.
• The third suggestion of the Zachman grid is that cells in columns should be related to each
other. Consider, for example, the data column (the first column) of the Zachman grid. From the
business owner's (Bret's) perspective, data is information about the business. From the
database administrator's perspective, data is rows and columns in the database.
23. ZACHMAN GRID
5 ways in which the Zachman grid can help in the development of a enterprise architecture
• Ensure that every stakeholder's perspective has been considered for every descriptive focal
point.
• Improve the client’s artifacts themselves by sharpening each of their focus points to one
particular concern for one particular audience.
• Ensure that all of client’sbusiness requirements can be traced down to some technical
implementation.
• Convince client’s technical team isn't planning on building a bunch of useless functionality.
• Convince Client that the business folks are including her IT folks in their planning.
24. THE OPEN GROUP ARCHITECTURE FRAMEWORK (TOGAF)
• TOGAF is the Architecture Development Method
• TOGAF divides an enterprise architecture into four categories, as follows
• Business architecture—Describes the processes the business uses to meet its goals
• Application architecture—Describes how specific applications are designed and how
they interact with each other
• Data architecture—Describes how the enterprise datastores are organized and accessed
• Technical architecture—Describes the hardware and software infrastructure that
supports applications and their interactions
• Zachman tells you how to categorize your artifacts. TOGAF gives you a process for
creating them.
25. DAY-TO-DAY EXPERIENCE OF CREATING AN ENTERPRISE ARCHITECTURE
WILL BE DRIVEN BY THE ADM
A high-level view
26. PHASE A & PHASE B
• The culmination of Phase A will be a Statement of Architecture Work, which must be
approved by the various stakeholders before the next phase of the ADM begins. The
output of this phase is to create an architectural vision for the first pass through the
ADM cycle. Architect will guide Client into choosing the project, validating the project
against the architectural principles established in the Preliminary Phase, and ensure that
the appropriate stakeholders have been identified and their issues have been addressed.
• The Architectural Vision created in Phase A will be the main input into Phase B. Client’s
goal in Phase B is to create a detailed baseline and target business architecture and
perform a full analysis of the gaps between them.
• Phase B is quite involved—involving business modeling, highly detailed business
analysis, and technical-requirements documentation. A successful Phase B requires input
from many stakeholders. The major outputs will be a detailed description of the baseline
and target business objectives, and gap descriptions of the business architecture.
27. PHASE C
• Develop baseline data-architecture description
• Review and validate principles, reference models, viewpoints, and tools
• Create architecture models, including logical data models, data-management process models, and
relationship models that map business functions to CRUD (Create, Read, Update, Delete) data operations
• Select data-architecture building blocks
• Conduct formal checkpoint reviews of the architecture model and building blocks with stakeholders
• Review qualitative criteria (for example, performance, reliability, security, integrity)
• Complete data architecture
• Conduct checkpoint/impact analysis
• Perform gap analysis
• The most important deliverable from this phase will be the Target Information
and Applications Architecture.
28. PHASE D & PHASE E
• Phase D completes the technical architecture—the infrastructure necessary to support the
proposed new architecture. This phase is completed mostly by engaging with Client’s
infrastructure and technical team.
• Phase E evaluates the various implementation possibilities, identifies the major
implementation projects that might be undertaken, and evaluates the business opportunity
associated with each. The TOGAF standard recommends that Client’s first pass at Phase
E "focus on projects that will deliver short-term payoffs and so create an impetus for
proceeding with longer-term projects.―
• A good starting place to look for such projects is the organizational pain-points that
initially convinced by client’s CEO to adopt an enterprise architectural-based strategy
29. PHASE F , PHASE G & PHASE H
• Phase F is closely related to Phase E. In this phase, Teri works with MedAMore's
governance body to sort the projects identified in Phase E into priority order that include
not only the cost and benefits (identified in Phase E), but also the risk factors
• In Phase G, Client takes the prioritized list of projects and creates architectural
specifications for the implementation projects. These specifications will include
acceptance criteria and lists of risks and issues
• The final phase is H. In this phase, Client modifies the architectural change-management
process with any new artifacts created in this last iteration and with new information that
becomes available