O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Understanding Association Rule Mining

Slide helps in generating an understand about the intuition and mathematics / stats behind association rule mining. This presentation starts by highlighting the difference between causal and correlation. This is followed Apriori algorithm and the metrics which are used with it. Each metric is discussed in detail. Then a formulation has been generated in classification setting which can be used to generate rules i.e. rule mining.

Other Reference: https://www.slideshare.net/JustinCletus/mining-frequent-patterns-association-and-correlations

  • Entre para ver os comentários

Understanding Association Rule Mining

  1. 1. Association Rule Mining
  2. 2. Understanding Association Rules Mining Concepts
  3. 3. Association Rule Mining Association rule mining is a procedure which is meant to find frequent patterns, correlations, associations, or causal structures from datasets found in various kinds of databases such as relational databases, transactional databases, and other forms of data repositories. Simply; when this, then also this
  4. 4. Association Rule Mining Used to identify - ● Frequent Patterns ● Correlations ● Associations ● Causal Structures where these are applied → movie recommendations, grocery item placements, product recommendations, etc.
  5. 5. Algorithm - Apriori - Metrics Following three metrics are generally used - Support: The percentage of transactions that contain all of the items in an item set. ● The higher the support the more frequently the item set occurs. ● Rules with a high support are preferred since they are likely to be applicable to a large number of future transactions. Confidence: The probability that a transaction that contains the items on the left hand side of the rule also contains the item on the right hand side. ● The higher the confidence, the greater the likelihood that the item on the right hand side will be purchased or, in other words, the greater the return rate we can expect for a given rule. Lift: The probability of all of the items in a rule occurring together divided by the product of the probabilities of the items on the left and right hand side occurring as if there was no association between them. ● Overall, lift summarizes the strength of association between the products on the left and right hand side of the rule; the larger the lift the greater the link between the two products.
  6. 6. Apriori - Support
  7. 7. Apriori - Support
  8. 8. Apriori - Confidence
  9. 9. Apriori - Confidence
  10. 10. Apriori - Confidence
  11. 11. Apriori - Confidence
  12. 12. Apriori - Lift
  13. 13. Apriori - Lift
  14. 14. Apriori - Lift
  15. 15. Algorithm Step 1 Set a minimum & maximum Support and Confidence. Step 2 Take all the subsets in transactions having higher support than minimum support. Step 3 Take all the rules of these subsets having higher confidence than minimum confidence. Step 4 Generate other rule assessment measures for the rules. Step 5 Sort the rules by using an appropriate filter. Cons → Slow Algorithm as it’s a bottom up approach and makes pair from all available factors and compute related statistics
  16. 16. Another Example
  17. 17. Other Rule Assessment Measures ● Added Value ● All-confidence ● Casual Confidence ● Casual Support ● Certainty Factor ● Chi-Squared ● Cross-Support Ratio ● Collective Strength ● Confidence ● Conviction ● Cosine ● Coverage ● Descriptive Confirmed Confidence ● Difference of Confidence ● Example & Counter-Example Rate ● Fisher's Exact Test ● Gini Index ● Hyper-Confidence ● Hyper-Lift ● Imbalance Ratio ● Improvement ● Jaccard Coefficient ● J-Measure ● Kappa ● Klosgen ● Kulczynski ● Goodman-Kruskal Lambda ● Laplace Corrected Confidence ● Least Contradiction ● Lerman Similarity ● Leverage ● Lift ● MaxConf ● Mutual Information ● Odds Ratio ● Phi Correlation Coefficient ● Ralambrodrainy Measure ● Relative Linkage Disequilibrium ● Relative Support ● Rule Power Factor ● Sebag-Schoenauer Measure ● Support ● Varying Rates Liaison ● Yule's Q ● Yule's Y
  18. 18. Support, Relative Support Support: ● Support of a rule is defined as the number of transactions that contain both X and Y. ● Used as a measure of significance of a rule. Symmetric Measure Range: [0, INF) Formula: Relative Support: ● Relative Support is the fraction of transactions that contain both X and Y. ● ⇒ Empirical Joint Probability of the items comprising the rule. ● Used as a measure of significance of a rule. Symmetric Measure Range: [0, 1] Formula:
  19. 19. Support, Relative Support
  20. 20. Confidence (a.k.a. Strength) Confidence: ● Confidence of a rule is the conditional probability that a transaction contains the consequent Y given that it contains the antecedent X. ● Problem with Confidence is that it is sensitive to the frequency of the consequent Y in the database. ● Caused by the way the confidence is calculated, consequents with higher support will automatically produce higher confidence values even if there is no association b/w the items. Asymmetric Measure Range: [0, 1] Formula:
  21. 21. Confidence (a.k.a. Strength)
  22. 22. Lift (a.k.a. Interest) Lift: ● Lift is defined as the ratio of the observed joint probability of X and Y to the expected joint probability if they were statistically independent. ● Lift is susceptible to noise in small databases. ● Caused by the way the confidence is calculated, rare itemsets with low counts (low probability) which by chance occur a few times (or only once) together will produce enormous lift values. Symmetric Measure Range: [0, INF) (1 means independence) Formula:
  23. 23. Lift (a.k.a. Interest)
  24. 24. Coverage (a.k.a. antecedent support or LHS support) Coverage: ● Coverage is defined as the relative support of the antecedent X i.e. it is is the fraction of transactions that contain X. ● ⇒ Empirical Probability of the item X. ● Used as a measure of significance of a rule. Asymmetric Measure Range: [0, 1] Formula:
  25. 25. Difference of Confidence Difference of Confidence: ● . ● . ● . Asymmetric Measure Range: [-1, 1] Formula:
  26. 26. Certainty Factor (a.k.a. Loevinger) Certainty Factor: ● It is a measure of variation of the probability that Y is in transaction when only considering transactions with X. ● An increasing CF means a decrease of the probability that Y is not in a transaction that X is in. Negative CFs have a similar interpretation. Asymmetric Measure Range: [-1, 1] (0 means independence) Formula:
  27. 27. Leverage Leverage: ● Leverage measures the difference between the observed and expected joint probability of XY assuming that X and Y are independent. ● Leverage gives an absolute measure of how surprising a rule is and should be used together with lift. ● Can be interpreted as gap to independence. Symmetric Measure Range: [-1, 1] (0 means independence) Formula:
  28. 28. Leverage Rule A→ E may be preferable over the first two because it is simpler and has higher leverage
  29. 29. Jaccard Coefficient (a.k.a. Coherence) Jaccard Coefficient: ● This coefficient measure the similarity between two sets. Symmetric Measure Range: [-1, 1] (0 means independence) Formula:
  30. 30. Jaccard Coefficient (a.k.a. Coherence)
  31. 31. Contingency Table for X and Y
  32. 32. Conviction Conviction: ● Conviction measures the expected error of the rule i.e. how often X occurs in a transaction where Y does not. ● Thus it can be said that it is a measure of the strength of the rule wrt the complement of the consequent. ● If the joint probability of X!Y is less than that expected under independence of X and !Y, then conviction is high, and vice versa. ● An alternative to confidence which was found not to capture direction of association s adequately. Asymmetric Measure Range: [0, INF) (1 means independence, rule that always hold have INF) Formula:
  33. 33. Conviction
  34. 34. Odds Ratio Odds Ratio: ● It is defined as the odds of finding X in transactions which contain Y divided by the odds of finding X in transactions which do not contain Y. ● Lift is susceptible to noise in small databases. ● Odds ratios greater than 1 imply higher odds of Y occurring in the presence of X as opposed to its complement !X , whereas odds smaller than one imply higher odds of Y occurring with !X. Symmetric Measure Range: [0, INF) (1 means independence) Formula:
  35. 35. Odds Ratio
  36. 36. Mining the patterns to Develop Rules
  37. 37. Filter Used (CASE WHEN Itemset only present on BOTH Side THEN (FLOAT(CriticalClass_oddsRatio) - 0) WHEN Itemset present on BOTH Side THEN (FLOAT(CriticalClass_oddsRatio) - FLOAT(Gen_oddsRatio)) WHEN Itemset only present on GENERAL Side THEN (0 - FLOAT(Gen_oddsRatio)) END) AS Diff_CriticalClassGen_OddsRatio, Diff_CriticalClassGen_Conviction, Diff_CriticalClassGen_Supp, Diff_CriticalClassGen_Certainty, * FROM { | #Handling INFINITY value |FROM | Table |WHERE | #viewing entries ONLY present on GEN side OR viewing entries ONLY present on CriticalClass side } ORDER BY #rule_rhs desc, #rule_lhs desc, Diff_CriticalClassGen_OddsRatio DESC, Diff_CriticalClassGen_Conviction DESC, Diff_CriticalClassGen_Supp DESC, #Diff_CriticalClassGen_Certainty desc,
  38. 38. Mining Pattern Step 1: Run the query. Step 2: Be creative and with some intuition select some item. Step 3: Modify the query so that it gives pair with selected item and again be creative and with intuition select some item. Using the discovered pair for further increasing the pattern Step a: use the discovered pair as lhs part and run the query on table with increased rule length. Step b: Be creative and with some intuition select the next item. Using the discovered pair for further analyzing Step a: Use the existing pair to get raw data and analyze it. Step b: Use the existing pair to get derived parameter data and analyze it (also check for existing critical class signature + location). Step c: If discovered pair indeed is adequate and is finding some critical class, use this signature. - Testing for FP - If adequate use it for blocking Rule Developed
  39. 39. Mining the patterns to Develop Rules Limitation and Further Work
  40. 40. Issues and Fine tuning ● Issues b/c of the data inconsistency in streaming data ● Modifying data Preprocessing for the itemset ● ● Version on Derived Parameters

×