SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
A FIRST STEP TOWARDS
CONTENT PROTECTING
PLAGIARISM DETECTION
Cornelius Ihle, Moritz Schubotz,
Norman Meuschke, Bela Gipp
1. Problem
Detecting academic plagiarism without revealing a document’s plaintext
2. Methodology
Similarity detection in arXiv documents using Bibliographic Coupling computed from
hashed reference combinations
3. Results
Hashed reference combinations are effective in preventing preimage attacks and can be
used to compare document features in a content protecting manner.
2
OUTLINE
PROBLEM
“The use of ideas, concepts, words, or
structures without appropriately
acknowledging the source to benefit in
a setting where originality is expected.”
4
ACADEMIC PLAGIARISM
Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud,
and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
Current Plagiarism Detection Systems:
• Perform sophisticated text analysis
• Centralized systems run by individual, typically commercial providers
• Require the disclosing of the full content of input documents
• Need explicit approval from the author to comply with data protection
laws (e.g., GDPR)
Prior research:
• Analyzing citation-based document features [1]
• Analyzing image-based document features [2]
• Analyzing mathematical document features [3]
5
PROBLEM SUMMARY
HyPlag: Document view with features
[1] B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and
Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engin. (DocEng), 2011.
[2] N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism
Detection Approach,” in Proc. Joint Conf. on Digital Libraries (JCDL), 2018.
[3] N. Meuschke, V. Stange, M.Schubotz, M. Kramer, B. Gipp, “Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical
Content and Citations”, in Proc. Joint Conf. on Digital Libraries (JCDL), 2019.
Content protection is needed for
• Research grant proposals
• Documents compiled in cooperation with companies which entail
non-disclosure agreements
• Student work subject to data protection and copyright laws
• Not openly accessible work that requires explicit consent from
the rights owner for document transfer to a third-party
6
PROBLEM SUMMARY CONT.
Necessary advancement:
Similarity detection methods that do not require plaintext inputs
METHODOLOGY
8
NON-TEXTUAL DOCUMENT FEATURES
9
BIBLIOGRAPHIC DOCUMENT FEATURES
REFERENCES
[1] ea5afcae6d27e161d33706bd5de5d6dcb7d84453
[2] 1dd94bca629645d86ca56f9ea95e74910e50fdb6
[3] 83a6b299bd122d80e493e7720fd1dcad8180e3e2
[4] e614048f6bc6e8c8769d1833c0b5d90433093eca
[5] d6c4222e38eb51a1cae754eb9b0ed103effcbbe1
[6] 01be490050587de42ac8e020b7549475ee14e1bf
[7] 4c5bb8e9f3c6b38dd9755a529a4cef85c412df4b
[8] 265da1591b31a9c68f2f8b37c12162d444ae2df7
[9] 74e8befc7a13a5f0881abbe6e67b5feddf9dbae2
[10] ff54b90c67fd213f354069b2294dfa6286bb3e70
[11] 3a21952904aba93d310e44cd53cb5eec3587e292
[12] 2308743b3a0186d420ab36ab1149cb11877d8a83
[13] 6919e2b27c7cea13db3981753656f9066a135c0d
[14] e00551c1ad8f5805d078c818208eafe15e66d855
[15] 631e1e9d88a46dfb397f5b5f0e4bf9e4878116b5
10
DATASET
[4] Norman Meuschke, Vincent Stange, Moritz Schubotz, Michael Kramer, and Bela Gipp. 2019. Improving Academic Plagiarism Detection for STEM Documents by
Analyzing Mathematical Content and Citations. In Proceedings ACM/IEEE Joint Conferenceon Digital Libraries. https://doi.org/10.1109/JCDL.2019.00026
Ten confirmed cases (plagiarized
document and their source)
embedded in the dataset
Task Dataset
(105,120 arXiv
docs)
105,140 Documents [4]
The Final dataset contained
• 92,082 documents
• 1,726,359 unique bibliographic references
We excluded documents
• without processable reference data
• with more than 150 references
Preventing preimage attacks and preserving
privacy through hash combinations
1. We form a reference set out of k elements
2. Form all possible combinations
3. And hash the sets
SUBSET HASH GENERATION
11
Hashing single references is
vulnerable to a dictionary/preimage
attack
PRIVATE SET INTERSECTION
Detection of similar hashes from the input document and the hashes from previously
submitted documents 𝐻 ′ to compute the private BCS.
The fundamental operation is computed similarly to the
non-private Bibliographic Coupling strength:
12
PRIVATE SET INTERSECTION & PRIVATE BCS
RESULTS
To compare the effectiveness of PBC to the original BC method,
we computed 𝑠!"# and 𝑠"# for all ten test cases in our dataset using subset sizes
of 1, 2, and 3, respectively (k = 1, k = 2, k = 3)
14
EFFECTIVENESS
PBC and BC were equal for all test cases, showing
that private Bibliographic Coupling is equally effective
as standard BC.
The demand in storage space for a pre-image attack rises as expected by the power of k.
15
RESOURCES CONSUMPTION
We use the dblp bibliography5 to estimate the existing references in the field of computer
science. (As of May 2020, dblp contains 5.05 million records)
16
RESISTANCE TO PREIMAGE ATTACKS
[5] https://dblp.uni- trier.de
Computing the preimages (1ms/hash):
k = 1: 1.4 hours
k = 2: 404 years
k = 3: 680 million years
*Leading to a runtime complexity of O(𝑛k).
CONCLUSION
• BC and Private BC are equally effective
– Private BCS is less efficient due to the k-dependent overhead
• Hashed sets can prevent preimage attacks on Private BC
– The appropriate degree of domain inflation through combinations depends on the
hardware capabilities and required level of security
17
Decentralization
ü Content protecting similarity detection methods
o Peer-to-peer computation networks
18
OUTLOOK
Necessary advancement:
Integration in an peer-to-peer academic
cooperative network
CONTACT
Cornelius Ihle
@CorneliusIhle
ihle.cornelius@gmail.com
PAPER, DATA, CODE, PROTOTYPE
github.com/ag-gipp/20CppdData
OTHER PROJECTS & PUBLICATIONS
dke.uni-wuppertal.de

Mais conteúdo relacionado

Mais procurados

Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?Paul Groth
 
Leveraging the power of the web - Open Repositories 2015
Leveraging the power of the web - Open Repositories 2015Leveraging the power of the web - Open Repositories 2015
Leveraging the power of the web - Open Repositories 2015Kaitlin Thaney
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationeSAT Journals
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongeSAT Publishing House
 
BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE...
 BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE... BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE...
BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE...Nexgen Technology
 
Does DH Scholarship Take Place in the Lab?
Does DH Scholarship Take Place in the Lab?Does DH Scholarship Take Place in the Lab?
Does DH Scholarship Take Place in the Lab?Shawn Day
 
Detecting java software similarities by using different clustering
Detecting java software similarities by using different clusteringDetecting java software similarities by using different clustering
Detecting java software similarities by using different clusteringDavide Ruscio
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Kaitlin Thaney
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Sciencedatasciencekorea
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dassDiego Pessoa
 
Use of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionUse of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionSelman Bozkır
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEEFINALYEARSTUDENTPROJECTS
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-cziPaul Groth
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 

Mais procurados (20)

Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
Leveraging the power of the web - Open Repositories 2015
Leveraging the power of the web - Open Repositories 2015Leveraging the power of the web - Open Repositories 2015
Leveraging the power of the web - Open Repositories 2015
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representation
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures along
 
BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE...
 BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE... BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE...
BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE...
 
Does DH Scholarship Take Place in the Lab?
Does DH Scholarship Take Place in the Lab?Does DH Scholarship Take Place in the Lab?
Does DH Scholarship Take Place in the Lab?
 
Detecting java software similarities by using different clustering
Detecting java software similarities by using different clusteringDetecting java software similarities by using different clustering
Detecting java software similarities by using different clustering
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Science
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dass
 
V01 i010414
V01 i010414V01 i010414
V01 i010414
 
Keyword query routing
Keyword query routingKeyword query routing
Keyword query routing
 
Use of hog descriptors in phishing detection
Use of hog descriptors in phishing detectionUse of hog descriptors in phishing detection
Use of hog descriptors in phishing detection
 
Accelerating your research with Microsoft Azure
Accelerating your research with Microsoft AzureAccelerating your research with Microsoft Azure
Accelerating your research with Microsoft Azure
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 

Semelhante a A First Step Towards Content Protecting Plagiarism Detection

Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...remAYDOAN3
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionTELKOMNIKA JOURNAL
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...IJCSIS Research Publications
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringKelly Lipiec
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesJason Hattrick-Simpers
 
Semantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationSemantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationcsandit
 
Academic Linkage A Linkage Platform For Large Volumes Of Academic Information
Academic Linkage  A Linkage Platform For Large Volumes Of Academic InformationAcademic Linkage  A Linkage Platform For Large Volumes Of Academic Information
Academic Linkage A Linkage Platform For Large Volumes Of Academic InformationAmy Roman
 
Activity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareActivity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareEditor IJCATR
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachAIRCC Publishing Corporation
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHijcsit
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536IJRAT
 
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
 
Pre-defense_talk
Pre-defense_talkPre-defense_talk
Pre-defense_talkaphex34
 
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MININGcsandit
 

Semelhante a A First Step Towards Content Protecting Plagiarism Detection (20)

Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detection
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathe...
Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathe...Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathe...
Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathe...
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop Slides
 
Semantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationSemantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' information
 
Academic Linkage A Linkage Platform For Large Volumes Of Academic Information
Academic Linkage  A Linkage Platform For Large Volumes Of Academic InformationAcademic Linkage  A Linkage Platform For Large Volumes Of Academic Information
Academic Linkage A Linkage Platform For Large Volumes Of Academic Information
 
Activity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareActivity Context Modeling in Context-Aware
Activity Context Modeling in Context-Aware
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536
 
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
 
Pre-defense_talk
Pre-defense_talkPre-defense_talk
Pre-defense_talk
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MINING
 

Mais de Scientific Information Analytics Group, Prof. Gipp

Mais de Scientific Information Analytics Group, Prof. Gipp (10)

A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
 
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
 
Towards Formula Concept Discovery and Recognition
Towards Formula Concept Discovery and RecognitionTowards Formula Concept Discovery and Recognition
Towards Formula Concept Discovery and Recognition
 
Too Late to Collaborate: Challenges to the Discovery of in-progress Research
Too Late to Collaborate:Challenges tothe Discovery ofin-progress ResearchToo Late to Collaborate:Challenges tothe Discovery ofin-progress Research
Too Late to Collaborate: Challenges to the Discovery of in-progress Research
 
Repurposing Open Source Tools for Open Science: a Practical Guide
Repurposing Open Source Tools for Open Science: a Practical GuideRepurposing Open Source Tools for Open Science: a Practical Guide
Repurposing Open Source Tools for Open Science: a Practical Guide
 
Blockchain based Trusted Timestamping for Research Data and Preprints using O...
Blockchain based Trusted Timestamping for Research Data and Preprints using O...Blockchain based Trusted Timestamping for Research Data and Preprints using O...
Blockchain based Trusted Timestamping for Research Data and Preprints using O...
 
Analyzing Nontextual Content Features to Detect Academic Plagiarism
Analyzing Nontextual Content Features to Detect Academic PlagiarismAnalyzing Nontextual Content Features to Detect Academic Plagiarism
Analyzing Nontextual Content Features to Detect Academic Plagiarism
 
A Semantically Enriched Recommendation & Visualization Approach for Academic ...
A Semantically Enriched Recommendation & Visualization Approach for Academic ...A Semantically Enriched Recommendation & Visualization Approach for Academic ...
A Semantically Enriched Recommendation & Visualization Approach for Academic ...
 
An Adaptive Image-based Plagiarism Detection Approach
An Adaptive Image-based Plagiarism Detection ApproachAn Adaptive Image-based Plagiarism Detection Approach
An Adaptive Image-based Plagiarism Detection Approach
 
Automatic Mathematical Information Retrieval to Perform Translations up to Co...
Automatic Mathematical Information Retrieval to Perform Translations up to Co...Automatic Mathematical Information Retrieval to Perform Translations up to Co...
Automatic Mathematical Information Retrieval to Perform Translations up to Co...
 

Último

9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 

Último (20)

9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 

A First Step Towards Content Protecting Plagiarism Detection

  • 1. A FIRST STEP TOWARDS CONTENT PROTECTING PLAGIARISM DETECTION Cornelius Ihle, Moritz Schubotz, Norman Meuschke, Bela Gipp
  • 2. 1. Problem Detecting academic plagiarism without revealing a document’s plaintext 2. Methodology Similarity detection in arXiv documents using Bibliographic Coupling computed from hashed reference combinations 3. Results Hashed reference combinations are effective in preventing preimage attacks and can be used to compare document features in a content protecting manner. 2 OUTLINE
  • 4. “The use of ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originality is expected.” 4 ACADEMIC PLAGIARISM Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
  • 5. Current Plagiarism Detection Systems: • Perform sophisticated text analysis • Centralized systems run by individual, typically commercial providers • Require the disclosing of the full content of input documents • Need explicit approval from the author to comply with data protection laws (e.g., GDPR) Prior research: • Analyzing citation-based document features [1] • Analyzing image-based document features [2] • Analyzing mathematical document features [3] 5 PROBLEM SUMMARY HyPlag: Document view with features [1] B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engin. (DocEng), 2011. [2] N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism Detection Approach,” in Proc. Joint Conf. on Digital Libraries (JCDL), 2018. [3] N. Meuschke, V. Stange, M.Schubotz, M. Kramer, B. Gipp, “Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations”, in Proc. Joint Conf. on Digital Libraries (JCDL), 2019.
  • 6. Content protection is needed for • Research grant proposals • Documents compiled in cooperation with companies which entail non-disclosure agreements • Student work subject to data protection and copyright laws • Not openly accessible work that requires explicit consent from the rights owner for document transfer to a third-party 6 PROBLEM SUMMARY CONT. Necessary advancement: Similarity detection methods that do not require plaintext inputs
  • 9. 9 BIBLIOGRAPHIC DOCUMENT FEATURES REFERENCES [1] ea5afcae6d27e161d33706bd5de5d6dcb7d84453 [2] 1dd94bca629645d86ca56f9ea95e74910e50fdb6 [3] 83a6b299bd122d80e493e7720fd1dcad8180e3e2 [4] e614048f6bc6e8c8769d1833c0b5d90433093eca [5] d6c4222e38eb51a1cae754eb9b0ed103effcbbe1 [6] 01be490050587de42ac8e020b7549475ee14e1bf [7] 4c5bb8e9f3c6b38dd9755a529a4cef85c412df4b [8] 265da1591b31a9c68f2f8b37c12162d444ae2df7 [9] 74e8befc7a13a5f0881abbe6e67b5feddf9dbae2 [10] ff54b90c67fd213f354069b2294dfa6286bb3e70 [11] 3a21952904aba93d310e44cd53cb5eec3587e292 [12] 2308743b3a0186d420ab36ab1149cb11877d8a83 [13] 6919e2b27c7cea13db3981753656f9066a135c0d [14] e00551c1ad8f5805d078c818208eafe15e66d855 [15] 631e1e9d88a46dfb397f5b5f0e4bf9e4878116b5
  • 10. 10 DATASET [4] Norman Meuschke, Vincent Stange, Moritz Schubotz, Michael Kramer, and Bela Gipp. 2019. Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations. In Proceedings ACM/IEEE Joint Conferenceon Digital Libraries. https://doi.org/10.1109/JCDL.2019.00026 Ten confirmed cases (plagiarized document and their source) embedded in the dataset Task Dataset (105,120 arXiv docs) 105,140 Documents [4] The Final dataset contained • 92,082 documents • 1,726,359 unique bibliographic references We excluded documents • without processable reference data • with more than 150 references
  • 11. Preventing preimage attacks and preserving privacy through hash combinations 1. We form a reference set out of k elements 2. Form all possible combinations 3. And hash the sets SUBSET HASH GENERATION 11 Hashing single references is vulnerable to a dictionary/preimage attack
  • 12. PRIVATE SET INTERSECTION Detection of similar hashes from the input document and the hashes from previously submitted documents 𝐻 ′ to compute the private BCS. The fundamental operation is computed similarly to the non-private Bibliographic Coupling strength: 12 PRIVATE SET INTERSECTION & PRIVATE BCS
  • 14. To compare the effectiveness of PBC to the original BC method, we computed 𝑠!"# and 𝑠"# for all ten test cases in our dataset using subset sizes of 1, 2, and 3, respectively (k = 1, k = 2, k = 3) 14 EFFECTIVENESS PBC and BC were equal for all test cases, showing that private Bibliographic Coupling is equally effective as standard BC.
  • 15. The demand in storage space for a pre-image attack rises as expected by the power of k. 15 RESOURCES CONSUMPTION
  • 16. We use the dblp bibliography5 to estimate the existing references in the field of computer science. (As of May 2020, dblp contains 5.05 million records) 16 RESISTANCE TO PREIMAGE ATTACKS [5] https://dblp.uni- trier.de Computing the preimages (1ms/hash): k = 1: 1.4 hours k = 2: 404 years k = 3: 680 million years *Leading to a runtime complexity of O(𝑛k).
  • 17. CONCLUSION • BC and Private BC are equally effective – Private BCS is less efficient due to the k-dependent overhead • Hashed sets can prevent preimage attacks on Private BC – The appropriate degree of domain inflation through combinations depends on the hardware capabilities and required level of security 17
  • 18. Decentralization ü Content protecting similarity detection methods o Peer-to-peer computation networks 18 OUTLOOK Necessary advancement: Integration in an peer-to-peer academic cooperative network
  • 19. CONTACT Cornelius Ihle @CorneliusIhle ihle.cornelius@gmail.com PAPER, DATA, CODE, PROTOTYPE github.com/ag-gipp/20CppdData OTHER PROJECTS & PUBLICATIONS dke.uni-wuppertal.de