6. Many Roads to Plagiarism
Paraphrased plagiarism
Back-translation: the latest form of plagiarism
Michael Jones University of Wollongong, Australia
4th Asia Pacific Conference on Educational Integrity (4APCEI) 28–30 September 2009
Paraphrased plagiarism is not new either. However, there are new
tools to aid in automatically paraphrasing text which are accelerating
this form of detection avoidance.
Paraphrase plagiat n'est pas nouveau non plus. Toutefois, il existe
de nouveaux outils pour l'aide dans le texte paraphrase
automatiquement qui sont l'accélération de cette forme d'évasion de
détection.
Paraphrase plagiarism is not new either. However, there are
new tools to help in paraphrasing the text automatically, which are
accelerating this form of escape detection.
9. Paraphrasing vs Textual
Entailment
Two sentences are paraphrased if they
“mean the same thing”:
1) Similarity: they share a substantial
amount of information
2) Dissimilarities are extraneous: if
extra information in the sentences
exists, the effect of its removal is not
significant.
10. Paraphrasing vs Textual
Entailment
A paraphrase is a special case of textual
entailment. A paraphrase is reflexive
whereas textual entailment indicates
that t wo sentences overlap to a degree
with one sentence being subsumed by
the other.
11. Ways to Paraphrase
Lexical substitution/synonymy
Hypo/Syno/Hyper-nym replacement: article,
paper or red, crimson
• Acronym replacement: Mr., mister
• Contractions: do not, don’t
Compounding/decompounding: ballgame, ball
game
• Numeric/Alphabetic numbers: 11, eleven;
12/1/2010, December first t wo-thousand-ten
12. Ways to Paraphrase
Active and passive exchange
The gangster killed 3 innocent people.
vs Three innocent people are killed by
the gangster.
• Re-ordering of sentence components
Tuesday they met vs They met Tuesday
Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
13. Ways to Paraphrase
Realization in different syntactic
components
Palestinian leader Arafat vs
Arafat, Palestinian leader
Prepositional phrase attachment
The Alabama plant vs
A plant in Alabama
Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
14. Ways to Paraphrase
Change into different sentence types
Who drew this picture? vs
Tell me who drew this picture.
Morphological derivation
He is a good teacher. vs
He teaches well. vs
He is good at teaching.
Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
15. Ways to Paraphrase
Light verb construction
The film impressed him. vs
The film made an impression on him.
Comparatives vs. superlatives
He is smarter than everyone else. vs
He is the smartest one.
Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
16. Ways to Paraphrase
Converse word substitution
John is Mary's husband. vs
Mary is John's wife.
Verb nominalization
He wrote the book. vs
He was the author of the book.
Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
17. Ways to Paraphrase
Substitution using words with
overlapping meanings
Bob excels at mathematics. vs
Bob studies mathematics well.
Inference
He died of cancer. vs
Cancer killed him.
Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
18. Ways to Paraphrase
Different semantic role realization
He enjoyed the game. vs
The game pleased him.
Subordinate clauses vs separate
sentences lined by anaphoric pronouns.
The tree healed its wounds by growing
new bark. vs
The tree healed its wounds. It grew
new bark.
Zhou, Ming and Niu, Cheng. “Principled Approach to Paraphrasing.” U.S. Patent 11,934,010. 1 Nov 2007.
19. Tools of the Trade
Microsoft paraphrase corpus
Used to test algorithms
WordNet: English only :(
Synonyms, hypernyms, hyponyms,
and antonyms.
Algorithms: Finite State Transducers
(FSTs) and/or iterative Longest Common
Sequence (LCS) on sets.
20. Tools of the Trade
Stemming or lemmatization
am, are, is be
car, cars, car's, cars' car
21. Word Alignment Examples
According to the MS paraphrase corpus:
This is a paraphrase
12/14 = 86%
12/16 = 75%
Not Paraphrased (However, the first sentence is textually entailed by the second.
Turnitin would currently match this.)
18/19 = 95%
18/26 = 69%
24. Translated Plagiarism
Non-English markets, in particular, are
concerned about their English as a
second language students submitting
English documents that have been
translated to their native language.
25. Translated Plagiarism
Initial approach:
Non-English documents searched as
they are now
Additional search performed:
Translate document to English, search
English documents, and then display
English matches with translations (or
vice versa)
27. Translated Plagiarism: Need
for Paraphrasing?
Machines and humans translate text in
many different ways.
Paraphrase detection allows us to
match the variations.
Google translate: The zeitgeist is thinking and feeling one age. The term describes
the characteristics of a particular period, or an attempt to remind us it. The German
word Zeitgeist is transferred through English as a loanword into numerous other
languages been.
Bing translate: Zeitgeist is thinking and feeling how an age. Is the nature of a
particular era or trying to understand them. The German word Zeitgeist is taken from
English as a loanword in many other languages.
http://de.wikipedia.org/wiki/Zeitgeist
How many people know who Sergey Brin and Larry Page are? For those of you who didn’t raise their hands, they are the founders of Google. Did you know that some of Sergey’s original research wrote was on a plagiarism detection system he wrote with his collaborators. This so called ‘cat and mouse game’ is common place. Rules are meant to be broken. For instance, people who like to drive fast will buy radar detectors instead of abiding by the speed limit. At Turnitin.com we find the same to be true with plagiarism detection. Rules are meant to be broken and students will find or develop new ways to circumvent the system. This talk explores some of the new counter-detection methods being used and what we are doing to counter them. The details are very technical. I’m going to stay away from these details so as to not bore you to death.
First I would like to give a quick survey of the different methods being employed today to avoid detection. None of these are new per se. However, new digital tools are accelerating their use just as the digital authoring of documents, email and the many other modes of document sharing, and, most importantly, the internet made plagiarism a pandemic. Then I’ll switch gears and get a little more technical on you to discuss the finer points of paraphrasing and its comparison to textual entailment.
First I would like to give a quick survey of the different methods being employed today to avoid detection. None of these are new per se. However, new digital tools are accelerating their use just as the digital authoring of documents, email and the many other modes of document sharing, and, most importantly, the internet made plagiarism a pandemic. I believe the most common method of plagiarizing remains copying text from one or more sources where the ‘author’ edits the text to sew the pieces into their paper. Along these lines, you might find it not surprising that Wikipedia is the number one source of internet matches we find in the Turnitin service. Whereas peer collusion accounts for the largest number of matches.
Translated plagiarism or the repurposing of content translated from a foreign language content, isn’t a new phenomenon.
However, growing anecdotal evidence suggests that students and researchers are using this method of plagiarizing to avoid the current detection technologies. In the US the growing population of foreign students and wide availability of machine translation technologies, e.g., Google translate, is thought to be attributing to the rise of this phenomenon.
Tools that paraphrase for you have grown in abundance and sophistication. A simple search for ‘article spinner’ turns up 1.3 million results from Google and a large number of adds for companies promoting their services. This sort of service also goes by other names, such as synonymizers.
Typically these services are aimed at online marketers looking to produce many versions of document that search engines will find unique to artificially promote their site by increasing the number of backlinks to their site. However, they are equally effective in rewriting a student paper.
Having computers understand how similar two sentences are to one another is a rich area of academic and corporate research. The utility of this technology is widespread. Everything from a question and answer system like Wolfram alpha being able to respond to a query the same way despite the multitude of ways you can phrase your question to Google being able to understand the relatedness of web pages at the phrase or sentence level instead of just at a ‘bag of words’ level.
Now I would like to outline the myriad of ways that an author can paraphrase. Although the details of each method isn’t so important. On the whole it demonstrates how rich languages are and, to a lot of people’s surprise, how unique writing is. There are many, many ways to deliver the same information through writing. Initial research is focused on English but the algorithmic framework is being generalized to work with all languages.
Hopefully, by now you can see how hard a natural language processing problem detecting paraphrases is.
I won’t go into detail regarding the algorithms/methods we are exploring but I will highlight some of the tools of the trade. I would also like to point out how different creating production quality code which can deal with enormous scale is in comparison to prototyping a solution. Most solutions simply can’t scale to processing hundreds of thousands of documents against a collection of tens of billions of documents. Microsoft was gracious enough to produce a paraphrase test corpus consisting of 5800 of sentence pairs of which 3900 were considered “semantically equivalent” by two human raters. What is interesting about this is that 83% of the sentence pairs were deemed the same by two raters but the remaining 17% required a third rater to break the stalemate. This elucidates another issue with paraphrase detection, their is a certain level of subjectiveness in ascertaining whether two sentences are equivalent or not. The same situation holds in plagiarism detection. When it comes to small matches one person’s plagiarism is another person’s noise.
The textbook definition of lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. The difference between lemmatization and a stemmer is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. http://en.wikipedia.org/wiki/Lemmatisation
So Turnitin currently does a certain level of paraphrase and textual entailment matching. To that end we’ve spent a lot of time adjusting the algorithms so that they are at the same time effective matching text while not producing ‘noisy’ or spurious matches. This is one of the hard problems that we are trying to solve. If done incorrectly, paraphrased or textually entailed matches which allow for a much higher degree of change could swamp a report in spurious matches. 18/19 18/26
One way to visualize this is to compare the sentences of an document against itself. The document typically is discussing a particular topic and it has a consistent voice. To this end for this presentation, I took a news article about Google’s new buzz messaging service and did a document wise comparison of the sentences. In this example you can see that the sentences are somewhat semantically related but wouldn’t be deemed paraphrases of each other.
So what if I changed each sentence so that they became more similar. At what point do they become semantically equivalent enough for it to go from false positive to a true positive. Although maybe not the best example, I think it illustrates the pitfalls of developing such a system. I believe the answer has to do with context, parts of speech, and the importance of the entailed words. Generalizing this type of AI model is very difficult because it is easy to overfit the model against the training data.
Translated plagiarism is a particular type of cross-lingual information retrieval aimed at finding similar documents across languages.
The first step in offering a translated plagiarism detection service is to find a partner that offers machine translation. Language Weaver currently offers 37 language pairs. Although Google and others offer free translation services their use is contractually limited and not bound by service level agreements. Furthermore, we feel the quality of language weaver’s technology is at the moment superior to the ‘free’ services.
Inter- and intra-language matches will be displayed together. We are considering offering a confidence level of the inter-language match.