From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Hiding slides
1. Hiding sensitive items in association rule mining. …….Exploration of knowledge and privacy preserving. Presented By: M.Swarna Rekha. K.Reshma. Ch.Savanth. Vishnu Babu. Nag Santhosh. Moses.
2. Talk overview Growing privacy concerns. Why privacy preserving data mining? Approaches. Problem statement. Apriori algorithm. Problem description. Proposed algorithms. Illustrating Examples. Analysis. Conclusions. Software and Hardware requirement Specification
7. Why privacy preserving data mining? Multinational Corporations A company would like to mine its data for globally valid results But national laws may prevent transborder data sharing Public use of private data Data mining enables research studies of large populations But these populations are reluctant to release personal information
8. Example:Patient Records… Patient health records split among providers Insurance company Pharmacy Doctor Hospital Each agrees not to release the data without my consent Medical study wants correlations across providers Rules relating complaints/procedures to “unrelated” drugs Does this need patient consent? And that of every other patient! It shouldn’t! Rules shouldn’t disclose patient individual data
9. Approaches: The first approach is to alter the data before delivery to the data miner so that real values are obscured. The second approach assumes the data is distributed between two or more sites, and these sites cooperate to leam the global data mining results without revealing the data at their individual sites.
10. Introduction Our technique of altering the data is to selectively modify individual values from a database to prevent discovery of set of rules. Here we apply a group of heuristic solutions for reducing the number of occurrences of some frequent itemsets below a minimum user specified threshold. The second approach is to allow users access to only a subset of data while global data mining results can still be discovered.
11. Problem statement Mining of association rules. Let I = { i,, i2,…., im } be a set of literals, called items. Given a set of transactions D, where each transaction T is a set of items such that T is subset or equal to I , an association rule is an expression X=>Y where X,Y are subset or equal to I and XП Y = ø .An example of such a rule is that 90% of customers buy hamburgers also buy coke. The 90% here is called the confidence of the rule which means that 90% of transaction that contain X also contain Y. The support of the rule is the percentage of transactions that contain both X and Y. The problem of mining association rules is to find all rules that are greater than the user-specified minimum support and minimum confidence.
14. To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself.This set of candidates is denoted Ck .The join is performed where members of Lk-1 are joinable if their first (k-2) items are in common.
17. Procedure Apriori_gen(Lk-1:frequent (k-1)-itemsets; min sup: minimum support) 1. for each itemset l1є Lk-1 2. for each itemset l2є Lk-1 3. if (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^ …. ^ (l1[k - 2] = l2[k - 2]) ^ (l1[k - 1] < l2[k-1]) then { 4. c = l1join l2; // join step: generate candidates 5. if has infrequent subset(c;Lk-1) then 6. delete c; // prune step: remove unfruitful candidate 7. else add c to Ck; 8. } 9. return Ck; Procedure has infrequent subset(c: candidate k-itemset; Lk-1: frequent (k-1)-itemsets); // use prior knowledge 1. for each (k - 1)-subset s of c 2. if s !є Lk-1 then 3. return TRUE; 4. return FALSE;
18. Example: Transaction Database D C1: L1: Compare candidate support count with minimumSupport count 2 Scan D for Count of each Candidate
19. C2: L2: Generate C2 Candidates from L1 Scan D for Count of each candidate C3: Generate C3 Candidates from L2 Scan D for Count of each candidate L3: Generation of candidate itemsets and frequent itemsets, where the minimum support count is 2.
32. Proposed algorithms…… To hide an association rule,we can either decrease its support or its confidence to be smaller than pre-specified minimum support and minimum confidence.To decrease confidence of a rule, we propose two algorithms: Increase Support of LHS First(ISLF). Decrease Support of RHS First(DSRF). The first algorithm tries to increase the support of left hand side of rule.If it was not successful,it tries to decrease the support of the right hand side of the rule.
33. Algorithm ISLF: Input: (1) A source database D, (2) A min-support, (3) A min-confidence, (4) A set of hidden items H Output: A transformed database D’, where rules containing H on RHS will be hidden Algorithm: 1. Find large I-item sets from D ; 2. For each hidden item h є H 3. If h is not a large I-item set, then H := H-{h} ; 4. If H is empty, then EXIT;// no AR contains H in RHS 5. Find large 2-itemsets from D ; 6. For each hє H { 7. For each large 2-itemset containing h {
34. 8. Compute confidence of rule U, where U is a rule of x-> h ; 9. If Confidence >min _ conf , then {//Increase Support of LHS 10. Find T1={t in D | t partially supports LHS(U); 11. Sort T1 in descending order by the number of supported items 12. Repeat { 13. choose the first transaction t from T1; 14. Modify t to support LHS(U); 15. Compute support and confidence of U };} 16. Until ( confidence (U) < min _ conf or T1 is empty ); 17. } ; //end if confidence>min-conf 18. If confidence > min-conf, then {/Decrease Support of RHS 19. Find T2 = { t in D I t supports RHS (U)} ; 20. Sort T2 in descending order by the number of supported items ; 21. Repeat {
35. 22. Choose the first transaction t from Tz; 23. Modify t to partially support RHS(U) ; 24. Compute support and confidence of U; } 25. Until ( confidence(U) <min-conf or T2 is empty ) ; 26. } ; //end if confidence>min-conf 27. If Confidence > min-conf, then 28. CAN NOT HIDE h ; 29. Else 30. Update D with new transaction t; 31. }//end of for each large 2-itemset 32. Remove h from H; 33. }//end of for each h є H Output updated D,as the transformed D’;
36. Example Running ISLF Algorithm Example 1: To hide an item C,the rule B C (50%,75%) will be hidden if transaction T5 is modified from 100 to 110 using ISL .To hide item B,the rule A B(67%,83%) will be hidden if transaction T1 is modified from 111 to 101 using DSR. Database before and after hiding item C,B using ISLF
37. Example 2: Here we reverse the order of hiding items.To hide the item B,the rule C B(50%,75%) will be hidden if transaction T5 is modified from 100 to 101 using ISL.To hide item C,the rule A C(83%,83%) will be hidden if transaction T1 is modified from 111 to 110 using DSR. Database before and after hiding item B,C using ISLF
38. Examples running DSRF Algorithm Example 3: To hide an item C,the rule B C(50%,75%) will be hidden if transaction T1 is modified from 111 to 110 using DSR.To hide item B,the rule C B(50%,67%) will be hidden due to transaction T1 is modified. Database before and after hiding item C,B using DSRF
39. Example 4: Here we reverse the order of hiding items.To hide item B,the rule C B(50%,75%) will be hidden if transaction T1 is modified from 111 to 101 using DSR.To hide item C,the rule B C will be hidden due to transaction T1 is modified. Database before and after hiding item B,C using DSRF
40. Analysis: The first characteristic is that the transformed databases are different under different ordering of hiding items. From the above illustrated examples database D2,D4 are generated using ISLF and D5,D6 are generated using DSRF algorithm. The second characteristic we analyze is the efficiency of the proposed algorithm compared with Dasseni’s algorithm.It can be seen that ISLF and DSRF algorithms require less database scanning and prune more number of association rules compared with Dasseni’s algorithm. DB Scans and Rules pruned in hiding item C using ISLF
41. One of the reasons that dasseni’s approach does not prune rules is that hidden rules are given in advance. Our approach needs to hide all rules containing hidden items on the right hands side,but dasseni’s approach can hide some of the rules containing hidden item on the right hand side. The third characteristic we analyze is efficiency comparison of the ISLF and DSRF algorithmsDSRF algorithm seems to be more effective when the support count of the hidden item is large. This is due to when support of right hand side of the rule is large; increase support of left hand side usually does not reduce the confidence of the rule but decrease support of right hand side usually decreases the confidence of the rule.
42. Conclusions: we have examined the database privacy problems caused by data mining technology and proposed two algorithms for hiding sensitive data in association rules mining. The proposed algorithms are based on modifying the database transactions so that the confidence of the association rules can be reduced. Examples demonstrating the proposed algorithms are shown. The efficiency of the proposed approach are further compared with Dasseni’s approach.It was shown that our approach required less number of database scanning and prune more number of hidden rules. However, our approach must hide all rules containing the hidden items on the right hand side, where Dasseni’s approach can hide some of the specified rules.
43. Software requirement specification: The proposed algorithms can be implemented using JAVA as Front-end and Oracle-9i as Back-end under Windows environment. Intel core 2 duo processor RAM size RAM speed Hardware requirement specification: