Detecting Machine-obfuscated Plagiarism

Foltynek, Tomas; Ruas, Terry; Scharpf, Philipp; Meuschke, Norman; Schubotz,  Moritz; Grosky, William; Gipp, Bela

Work Description

Title: Detecting Machine-obfuscated Plagiarism Open Access Deposited

Attribute	Value
Methodology	To create training sets, we used all 4,012 featured articles from the English Wikipedia because they objectively cover a wide range of topics in great breadth and { https://en.wikipedia.org/wiki/Wikipedia:Content_assessment}. Senior Wikipedia editors select articles of superior quality as featured articles (approx. 0.1% of all articles). Featured articles typically have numerous authors and undergo many revisions. Thus, they are written in high-quality English and unlikely to exhibit a bias towards the writing style of specific persons. Lastly, the articles are publicly available, which increases the reproducibility of our research. To obtain a training set for documents, we machine-paraphrased ({spun}) all articles using the SpinBot { https://spinbot.com/API} API. The service is the technical backbone of several widely-used OPT, e.g., Paraphrasing Tool { https://paraphrasing-tool.com/} and Free Article Spinner { https://free-article-spinner.com/}. Thus, the training set comprises of 8,024 articles (4,012 original, 4,012 spun). To create a test set for documents, we selected 1,990 Wikipedia articles labeled as good articles at random. To receive this label, articles must be well-written, verifiable, neutral, broad in coverage, stable, and illustrated by media. We paraphrased all articles using the SpinBot API to obtain the test set of 3,980 articles (1,990 original, 1,990 spun). To obtain the training and test sets for paragraphs, we split the original and spun articles from the document training set (8,024) and the document test set (3,980) into paragraphs. We discarded paragraphs with fewer than three sentences, as these typically represented titles or subtitles. The resulting training set consists of 241,187 paragraphs (117,445 original, 123,742 spun); the test set consists of 79,970 paragraphs (39,241 original, 40,729 spun).
Description	This data set is comprised of multiple folders. The corpus folder contains raw text used for training and testing in two splits, "document" and "paragraph". The Spun documents and paragraphs are generated using the SpinBot tool ( https://spinbot.com/API). The paragraph split is generated by only selecting paragraphs with 3 or more sentences in the document split. Each folder is divided in mg (i.e., machine generated through SpinBot) and og (i.e., original generated file) The human judgement folder contains the human evaluation between original and spun documents (sample). It also contains the answers (keys) and survey results. The models folder contains the machine learning classifier models for each word embedding technique used (only for document split training). The models were exported using pickle (Python 3.6). The grid search for hyperparameter adjustments is described in the paper. The vector folders (train and test) contains the average of all word vectors for each document and paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e, label mg or og). Each file belong to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file.
Creator	Foltynek, Tomas; Ruas, Terry; Scharpf, Philipp ; Meuschke, Norman; Schubotz, Moritz ; Grosky, William ; and Gipp, Bela
Depositor	[email protected]
Contact information	[email protected]
Discipline	General Information Sources
Keyword	paraphrase detection plagiarism detection document classification word embeddings
Citations to related material	Foltýnek, T. & Ruas, T. & Scharpf, P. & Meuschke, N. & Schubotz, M. & Grosky, W. & Gipp, B., “Detecting Machine-obfuscated Plagiarism,” in Sustainable Digital Communities, vol. 12051 LNCS, Springer, 2020, pp. 816–827. https://doi.org/10.1007/978-3-030-43687-2_68
Related items in Deep Blue Documents	Foltynek, T., Ruas, T., Scharpf, P., Meuschke, N., Schubotz, M., Grosky, W. & Gipp, B. (2019). Detecting Machine-obfuscated Plagiarism, University of Michigan - Deep Blue. https://deepblue.lib.umich.edu/handle/2027.42/152346
Resource type	Dataset
Curation notes	Citation to the related conference paper was added on March 23, 2020
Last modified	11/18/2022
Published	12/13/2019
Language	English
DOI	https://doi.org/10.7302/bewj-qx93
License	http://creativecommons.org/licenses/by/4.0/

To Cite this Work:
Foltynek, T., Ruas, T., Scharpf, P., Meuschke, N., Schubotz, M., Grosky, W., Gipp, B. (2019). Detecting Machine-obfuscated Plagiarism [Data set], University of Michigan - Deep Blue Data. https://doi.org/10.7302/bewj-qx93

Relationships


This work is not a member of any user collections.

Files (Count: 6; Size: 2.8 GB)

Title	Original Upload	Last Modified	File Size	Access	Actions
README.txt	2019-12-04	2019-12-13	4.77 KB	Open Access	View Details Download
corpus.zip	2019-12-04	2019-12-10	213 MB	Open Access	View Details Download
Human_judgement.zip	2019-12-16	2019-12-16	177 KB	Open Access	View Details Download
models.zip	2019-12-04	2019-12-04	330 MB	Open Access	View Details Download
vectors.test.zip	2019-12-04	2019-12-10	718 MB	Open Access	View Details Download
vectors.train.zip	2019-12-04	2019-12-11	1.57 GB	Open Access	View Details Download

Title: Detecting Machine-obfuscated Plagiarism
Authors: Tomas Foltynek, Terry Ruas, Philipp Scharpf, Norman Meuschke, Moritz Schubotz, William Grosky, and Bela Gipp
contact email: [email protected]; [email protected]; [email protected]
Venue: iConference 2020 - co-hosted by the University of Borås: Swedish School of Library and Information Science, and Oslo Metropolitan University: Department of Archivistics, Library and Information Science.
Year: 2020
===============================================================================
Research Statement:

Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classifiers for distinguishing human-written from machine-paraphrased text. The best performing classification approach achieves an accuracy of 99.0% for documents and 83.4% for paragraphs. Second, we show that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks. Third, we provide a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing. The data and code of our study are openly available.

===============================================================================
Dataset Description:

Training:
4,012 featured articles from the English Wikipedia because they objectively cover a wide range of topics in great breadth and depth {https://en.wikipedia.org/wiki/Wikipedia:Content/assessment}

Testing:
1,990 Wikipedia articles labeled as "good articles" at random. To receive this label, articles must be well-written, verifiable, neutral, broad in coverage, stable, and illustrated by media (https://en.wikipedia.org/wiki/Wikipedia:Content/assessment).

==============================================================================

==============================================================================
Dataset Structure:

[corpus] folder: contains de raw text (No pre-processing) used for train and test in two splits, "document" and "paragraph". The Spun documents and paragraphs are generated using the SpinBot tool (https://spinbot.com/API). The paragraph split is generated by only selecting paragraphs with 3 or more sentences in the document split. Each folder is divided in mg (i.e., machine generated through SpinBot) and og (i.e., original generated file)

wikipedia_documents_train.tar - 4,012 original documents; 4,012 spun documents
wikipedia_paragraphs_train.gz - 98,282 original paragraphs; 102,485 spun paragraphs
wikipedia_documents_test.zip - 1,990 original documents, 1,990 spun documents
wikipedia_paragraphs_test.zip - 39,241 original paragraphs, 40,729 spun paragraphs

[vector.[train|test] folder: contains the average of all word vectors for each document and paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e, label mg or og). Each file belong to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file.

The word embedding technique used is described in the file name with the following structure: --mean-.arff . Where

- d2v - doc2vec
google - word2vec
fasttextnw - fastText without subwording
fasttextsw - fastText with subwording
glove - Glove
use - Universal Sentence Encoder

Details for each technique used can be found in the paper.

- wikipediapar - Wikipedia paragraph split
- wikipediadoc - Wikipedia document split

- train or test

[models] folder: contains the machine learning classifier models for each word embedding technique used (only for document split training). The models were exported using pickle (Python 3.6).

The grid search for hyperparameter adjustments is described in the paper.

Machine Learning models: KNN, SVM, Random Forest, Naive Bayes, and Logistic Regression

[Human judgments] folder - contains the human evaluation between original and spun documents (sample). It also contains the answers (keys) and survey results.
NNNNN.txt - whole document from which an extract was taken for human evaluation
key.txt.zip - information about each case (ORIG/SPUN)
results.xlsx - raw results downloaded from the survey tool (the extracts which humans judged are in the first line)
results-corrected.xlsx - at the very beginning, there was a mistake in one question (wrong extract). These results were excluded

Update Provenance Log Entries

Download All Files (To download individual files, select them in the “Files” panel above)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to contact us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.