Presentation of our short paper
"A First Step Towards Content Protecting Plagiarism Detection"
at the Joint Conference on Digital Libraries (JCDL) 2020 taking place at Wuhan, China, August 2, 2020.
Pre-print of the paper: https://arxiv.org/pdf/2005.11504.pdf
Code and Data: https://github.com/ag-gipp/20CppdData
Forensic Biology & Its biological significance.pdf
A First Step Towards Content Protecting Plagiarism Detection
1. A FIRST STEP TOWARDS
CONTENT PROTECTING
PLAGIARISM DETECTION
Cornelius Ihle, Moritz Schubotz,
Norman Meuschke, Bela Gipp
2. 1. Problem
Detecting academic plagiarism without revealing a document’s plaintext
2. Methodology
Similarity detection in arXiv documents using Bibliographic Coupling computed from
hashed reference combinations
3. Results
Hashed reference combinations are effective in preventing preimage attacks and can be
used to compare document features in a content protecting manner.
2
OUTLINE
4. “The use of ideas, concepts, words, or
structures without appropriately
acknowledging the source to benefit in
a setting where originality is expected.”
4
ACADEMIC PLAGIARISM
Source: Teddi Fishman. 2009. ”We know it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud,
and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
5. Current Plagiarism Detection Systems:
• Perform sophisticated text analysis
• Centralized systems run by individual, typically commercial providers
• Require the disclosing of the full content of input documents
• Need explicit approval from the author to comply with data protection
laws (e.g., GDPR)
Prior research:
• Analyzing citation-based document features [1]
• Analyzing image-based document features [2]
• Analyzing mathematical document features [3]
5
PROBLEM SUMMARY
HyPlag: Document view with features
[1] B. Gipp and N. Meuschke, “Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and
Longest Common Citation Sequence,” in Proc. ACM Symp. on Document Engin. (DocEng), 2011.
[2] N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, and B. Gipp, “An Adaptive Image-based Plagiarism
Detection Approach,” in Proc. Joint Conf. on Digital Libraries (JCDL), 2018.
[3] N. Meuschke, V. Stange, M.Schubotz, M. Kramer, B. Gipp, “Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical
Content and Citations”, in Proc. Joint Conf. on Digital Libraries (JCDL), 2019.
6. Content protection is needed for
• Research grant proposals
• Documents compiled in cooperation with companies which entail
non-disclosure agreements
• Student work subject to data protection and copyright laws
• Not openly accessible work that requires explicit consent from
the rights owner for document transfer to a third-party
6
PROBLEM SUMMARY CONT.
Necessary advancement:
Similarity detection methods that do not require plaintext inputs
10. 10
DATASET
[4] Norman Meuschke, Vincent Stange, Moritz Schubotz, Michael Kramer, and Bela Gipp. 2019. Improving Academic Plagiarism Detection for STEM Documents by
Analyzing Mathematical Content and Citations. In Proceedings ACM/IEEE Joint Conferenceon Digital Libraries. https://doi.org/10.1109/JCDL.2019.00026
Ten confirmed cases (plagiarized
document and their source)
embedded in the dataset
Task Dataset
(105,120 arXiv
docs)
105,140 Documents [4]
The Final dataset contained
• 92,082 documents
• 1,726,359 unique bibliographic references
We excluded documents
• without processable reference data
• with more than 150 references
11. Preventing preimage attacks and preserving
privacy through hash combinations
1. We form a reference set out of k elements
2. Form all possible combinations
3. And hash the sets
SUBSET HASH GENERATION
11
Hashing single references is
vulnerable to a dictionary/preimage
attack
12. PRIVATE SET INTERSECTION
Detection of similar hashes from the input document and the hashes from previously
submitted documents 𝐻 ′ to compute the private BCS.
The fundamental operation is computed similarly to the
non-private Bibliographic Coupling strength:
12
PRIVATE SET INTERSECTION & PRIVATE BCS
14. To compare the effectiveness of PBC to the original BC method,
we computed 𝑠!"# and 𝑠"# for all ten test cases in our dataset using subset sizes
of 1, 2, and 3, respectively (k = 1, k = 2, k = 3)
14
EFFECTIVENESS
PBC and BC were equal for all test cases, showing
that private Bibliographic Coupling is equally effective
as standard BC.
15. The demand in storage space for a pre-image attack rises as expected by the power of k.
15
RESOURCES CONSUMPTION
16. We use the dblp bibliography5 to estimate the existing references in the field of computer
science. (As of May 2020, dblp contains 5.05 million records)
16
RESISTANCE TO PREIMAGE ATTACKS
[5] https://dblp.uni- trier.de
Computing the preimages (1ms/hash):
k = 1: 1.4 hours
k = 2: 404 years
k = 3: 680 million years
*Leading to a runtime complexity of O(𝑛k).
17. CONCLUSION
• BC and Private BC are equally effective
– Private BCS is less efficient due to the k-dependent overhead
• Hashed sets can prevent preimage attacks on Private BC
– The appropriate degree of domain inflation through combinations depends on the
hardware capabilities and required level of security
17
18. Decentralization
ü Content protecting similarity detection methods
o Peer-to-peer computation networks
18
OUTLOOK
Necessary advancement:
Integration in an peer-to-peer academic
cooperative network