Work Description

Title: Detecting Machine-obfuscated Plagiarism Open Access Deposited
Attribute Value
  • To create training sets, we used all 4,012 featured articles from the English Wikipedia because they objectively cover a wide range of topics in great breadth and {}. Senior Wikipedia editors select articles of superior quality as featured articles (approx. 0.1% of all articles). Featured articles typically have numerous authors and undergo many revisions. Thus, they are written in high-quality English and unlikely to exhibit a bias towards the writing style of specific persons. Lastly, the articles are publicly available, which increases the reproducibility of our research. To obtain a training set for documents, we machine-paraphrased ({spun}) all articles using the SpinBot {} API. The service is the technical backbone of several widely-used OPT, e.g., Paraphrasing Tool {} and Free Article Spinner {}. Thus, the training set comprises of 8,024 articles (4,012 original, 4,012 spun). To create a test set for documents, we selected 1,990 Wikipedia articles labeled as good articles at random. To receive this label, articles must be well-written, verifiable, neutral, broad in coverage, stable, and illustrated by media. We paraphrased all articles using the SpinBot API to obtain the test set of 3,980 articles (1,990 original, 1,990 spun). To obtain the training and test sets for paragraphs, we split the original and spun articles from the document training set (8,024) and the document test set (3,980) into paragraphs. We discarded paragraphs with fewer than three sentences, as these typically represented titles or subtitles. The resulting training set consists of 241,187 paragraphs (117,445 original, 123,742 spun); the test set consists of 79,970 paragraphs (39,241 original, 40,729 spun).
  • This data set is comprised of multiple folders. The corpus folder contains raw text used for training and testing in two splits, "document" and "paragraph". The Spun documents and paragraphs are generated using the SpinBot tool ( The paragraph split is generated by only selecting paragraphs with 3 or more sentences in the document split. Each folder is divided in mg (i.e., machine generated through SpinBot) and og (i.e., original generated file)

  • The human judgement folder contains the human evaluation between original and spun documents (sample). It also contains the answers (keys) and survey results.

  • The models folder contains the machine learning classifier models for each word embedding technique used (only for document split training). The models were exported using pickle (Python 3.6). The grid search for hyperparameter adjustments is described in the paper.

  • The vector folders (train and test) contains the average of all word vectors for each document and paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e, label mg or og). Each file belong to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file.
Contact information
Citations to related material
Article access in Deep Blue
Resource type
Last modified
  • 12/16/2019
  • 12/13/2019
To Cite this Work:
Foltynek, T., Ruas, T., Scharpf, P., Meuschke, N., Schubotz, M., Grosky, W., Gipp, B. (2019). Detecting Machine-obfuscated Plagiarism [Data set]. University of Michigan - Deep Blue.


Files (Count: 6; Size: 2.8 GB)

Download All Files (To download individual files, select them in the “Files” panel above)

Total work file size of 2.8 GB may be too large to download directly. Consider using Globus (see below).

Best for data sets > 3 GB. Globus is the platform Deep Blue Data uses to make large data sets available.   More about Globus