This document provides an overview of a SQL Server 2008 for Business Intelligence short course. It discusses the course instructor's background and specialties. The course will cover creating a data warehouse, OLAP cubes, and reports. It will also discuss data mining concepts like why it's used, common algorithms, and include a hands-on lab. Data mining algorithms that will be covered include classification, clustering, decision trees, and neural networks.
2. Peter Gfader Specializes in C# and .NET (Java not anymore) TestingAutomated tests Agile, ScrumCertified Scrum Trainer Technology aficionado Silverlight ASP.NET Windows Forms
3. Admin Stuff Attendance You initial sheet Hands On Lab You get me to initial sheet Homework Certificate At end of 5 sessions If I say if you have completed successfully
6. Last week(s) Other cube browsers Microsoft Data Analyzer Proclarity Excel 2003/2007/2010 Excel services Thinslicer Performance Point Power Pivot
7. Create report on top of Northwind Top 10 customers (Table) Top 10 products (Table) Top 10 employees (Table) 1 chart that shows the top 10 customers 1 usage of the gauge control (surprise me) Homework
9. Step by step to BI Create Data Warehouse Copy data to data warehouse Create OLAP Cubes Create Reports Browse the cube Do some Data Mining Discovering relationships Predict future events
10. Agenda What is Data Mining? Why? Uses Algorithms Demo Hands on Lab
11. What is Data Mining? “Data mining is the use of powerful software tools to discover significant traits or relationships,from databases or data warehouses and often used to predict future events”
12. What is Data Mining? It exploits statistical algorithms Once the “knowledge” is extracted it: Can be used to discover Can be used to predict values of other cases
13. Why Data Mining? Marketing Who picks the movie? The kids, the wife, me Who are our Customers and what sort of films do they hire? Is a 30 year old woman with 2 children going to hire Arnie’s latest film Validation Is this data sensible? Terminator 2 and Toy Story Prediction Sales Next Year
14. Get new information from data, future trends, past trends, outlier, maximums, minimums Analyse data from different perspectives and summarizing it into useful information New information to increase revenue cuts costs or both :-) Why? Its all about money
15. Who are our biggest customers? What are customers buying with cigars? What are the customer retention levels of our branches? Which customers have bought olives, feta cheese but no ciabatta bread? Which regions have the highest male/female ratio of single 20 somethings? Which region has lowest customer retention levels and list out lost customers? Which Questions are Data Mining?
16. Ad hoc query Drill through to details Business Intelligence tool What’s not data mining
17.
18. Good raw material good data miningSamples should be representative Samples "similar" to domain Not all-seeing crystal ball Verify and Validate! Data - Uncover patterns in samples
19. OLAP Is about fast ad hoc querying Analysis by dimensions and measures Gives precise answers Data Mining May use RDBMS or OLAP source Is about discovering and predicting Gives imprecise answers OLAP is not a prerequisite for data mining, but it almost always comes first OLAP versus Data Mining (learning to ride a bike before a car)
20. Classification algorithms predictone or more discrete variables, based on the other attributes in the dataset Regression algorithms predictone or more continuous variables, such as profit or loss, based on other attributes in the dataset Segmentation algorithms dividedata into groups, or clusters, of items that have similar properties Association algorithms find correlations between different attributes in a dataset Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web path flow Types of Data Mining Algorithms
21. Clustering Time Series Decision Trees Naïve Bayes Association Linear Regression Complete Set Of AlgorithmsWays to analyze your data Neural Network Sequence Clustering Logistic Regression
22. Split data Each of branch is like an attribute Brightness = amount of data Decision trees
23. Decision Trees (1) Decision Trees assign (classify) each case to one of a few (discrete) broad categories of selected attribute (variable) and explains the classification with few selected input variables The process of building is recursive partitioning – splitting data into partitions and then splitting it up more Initially all cases are in one big box
24. Decision Trees (2) The algorithm tries all possible breaks in classes using all possible values of each input attribute; it then selects the split that partitions data to the purest classes of the searched variable Several measures of purity Then it repeats splitting for each new class Again testing all possible breaks Unuseful branches of the tree can be pre-pruned or post-pruned
25. Decision Trees (3) Decision trees are used for classification and prediction Typical questions: Predict which customers will leave Help in mailing and promotion campaigns Explain reasons for a decision What are the movies young female customers like to buy?
27. Naïve Bayes Bayes Formula Uses statistics to say falls into certain category or not with probability Spam filtering: score of spam (Bayes) Testing only a particular attribute
28. Naïve Bayes Quickly builds mining models that can be used for classification and prediction It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute This can later be used to predict an outcome of the predicted attribute based on the known input attributes This makes the model a good option for exploring the data
29. Cluster Analysis (1) Grouping data into clusters Objects within a cluster have high similarity based on the attribute values The class label of each object is not known Several techniques Partitioning methods Hierarchical methods Density based methods Model based methods And more…
30. Cluster Analysis (2) Segments a heterogeneous population into a number of more homogenous subgroups or clusters Some typical questions: Discover distinct groups of customers Identification of groups of houses in a city In biology, derive animal and plant taxonomies Find outliers
33. Sequence clustering Numbers orders stronger associations Direction of association (not necessary the other direction)
34. If you own certain stocks ' you own maybe other ones as well Probability = thickness of line Association
35. Let system learn how to classify data Neural Network adapts to the new data Formulate statement/hypothesis Outcome is know (Data / Surveys) 1. 70% data to train network (outcome is known) 2. 30% of data to test network (outcome is known) 3. New data (no survey needed, predict from network) Other example: OCR Neural Nets
36. Both have directions Sequence Clustering has probability number and colour They are very similar. The difference is that Association analyses items that occur together whereas sequence clustering analyses items that follow one another. An example is that Sequence Clustering might be used by credit card companies to spot fraud, e.g. a petrol station refill followed by another petrol station refill followed by a big purchase = fraud (different transactions) Whereas Association will be more like: when someone buys popcorn at the cinemas, they also buy a drink (same transaction) Difference between algorithms: Association and Sequence
38. Visual Numerics 3rd party algorithms http://www.vni.com/company/whitepapers/ MicrosoftBIwithNumericalLibraries.pdf There is more...
39. Excel Data Mining Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft Office 2007 http://www.microsoft.com/downloads/en/details.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=en
40. Train station / airport Who is the bad guy Farmers Find the best crops Supermarket Find to figure out how to get you to buy more, where the expensive items Other usages of data miningFind patterns - Profiling
41. SSIS 2008 - Data profiling task Get a profile of the data in a table potential candidate keys length of data values in columns Null percentage of rows distribution of values .... Tip
42. Video: Simple data mining model http://www.sqlservercentral.com/articles/Video/65055/ Video: Data mining and Reporting Services http://www.sqlservercentral.com/articles/Video/64190/ Data Mining Algorithms http://msdn.microsoft.com/en-us/library/ms175595.aspx Resources 1
43. Jamie MacLennan http://blogs.msdn.com/b/jamiemac/ Richard Lees on BI http://richardlees.blogspot.com/ Book Data Mining with Microsoft SQL Server 2008 http://www.amazon.com/gp/product/0470277742?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742 Resources 2
46. Thank You! Gateway Court Suite 10 81 - 91 Military Road Neutral Bay, Sydney NSW 2089 AUSTRALIA ABN: 21 069 371 900 Phone: + 61 2 9953 3000 Fax: + 61 2 9953 3105 info@ssw.com.auwww.ssw.com.au
Notas do Editor
Click to add notesPeter Gfader shows SQL Server
Java current version 1.6 Update 211.7 released next year 2010Dynamic languages Parallel computingMaybe closures
3. Create the following report on top of Northwind Top 10 customers (Table) Top 10 products (Table) Top 10 employees (Table) 1 chart that shows the top 10 customers 1 usage of the gauge control (surprise me)a. Download Report builder 2 from http://www.microsoft.com/downloads/en/details.aspx?FamilyID=9f783224-9871-4eea-b1d5-f3140a253db6&displaylang=enb. Send me the screenshot of the final report
Data mining can be used to uncover patterns in data samples, it is important to be aware that the use of non-representative samples of data may produce results that are not indicative of the domainSimilarly, data mining will not find patterns that may be present in the domain, if those patterns are not present in the sample being "mined". There is a tendency for insufficiently knowledgeable "consumers" of the results to attribute "magical abilities" to data mining, treating the technique as a sort of all-seeing crystal ball. Like any other tool, it only functions in conjunction with the appropriate raw material: in this case, indicative and representative data that the user must first collect. Further, the discovery of a particular pattern in a particular set of data does not necessarily mean that pattern is representative of the whole population from which that data was drawn. Hence, an important part of the process is the verification and validation of patterns on other samples of data.
Data mining can be used to uncover patterns in data samples, it is important to be aware that the use of non-representative samples of data may produce results that are not indicative of the domain Similarly, data mining will not find patterns that may be present in the domain, if those patterns are not present in the sample being "mined". There is a tendency for insufficiently knowledgeable "consumers" of the results to attribute "magical abilities" to data mining, treating the technique as a sort of all-seeing crystal ball. Like any other tool, it only functions in conjunction with the appropriate raw material: in this case, indicative and representative data that the user must first collect. Further, the discovery of a particular pattern in a particular set of data does not necessarily mean that pattern is representative of the whole population from which that data was drawn. Hence, an important part of the process is the verification and validation of patterns on other samples of data.
http://msdn.microsoft.com/en-us/library/ms175595.aspxWays to analyze your dataDT = split dataEach of branch is like an attributeBrightness = amount of dataTODO: Check out barsClustering = mapping of popular pointsNumber of childrenDarkness = Lines are links between clusters (associations)Time seriesTimebased data predictionSequence clusteringNumbers orders stronger associationsDirection of association (not necessary the other direction)AssociationIf you own certain stocks you own maybe other ones as wellProbability = thickness of lineNaive BayesBayes FormulaUses statistics to say falls into certain category or not (with probabiblty)Spam filtering score of spam (bayes)Testing only a particular attributeNeural NetsLet system learn how to classify dataFormulate statement/hypothesisOutcome is know(Data / Surveys)1. 70% data to train network (outcome is known)2. 30% of data to test network (outcome is known)3. New data (no survey needed, predict from network)Ex: OCR Example above = get loyalty of customersNeural Network adapts to the new data
What attributes I am interested inAlgorithm splits data for me
Pruned = gestutzt
Diff. Color = relationshipUser clicked on toy story2
Very easy to setupClassifies and gives a score prediction
Class label:Combination of diff. AttributesName clusters yourself
Diff. Color = relationshipUser clicked on toy story2
Diff. Color = relationshipUser clicked on toy story2