Work Description

Title: Detecting Machine-obfuscated Plagiarism Open Access Deposited

http://creativecommons.org/licenses/by/4.0/
Attribute Value
Methodology
  • To create training sets, we used all 4,012 featured articles from the English Wikipedia because they objectively cover a wide range of topics in great breadth and { https://en.wikipedia.org/wiki/Wikipedia:Content_assessment}. Senior Wikipedia editors select articles of superior quality as featured articles (approx. 0.1% of all articles). Featured articles typically have numerous authors and undergo many revisions. Thus, they are written in high-quality English and unlikely to exhibit a bias towards the writing style of specific persons. Lastly, the articles are publicly available, which increases the reproducibility of our research. To obtain a training set for documents, we machine-paraphrased ({spun}) all articles using the SpinBot { https://spinbot.com/API} API. The service is the technical backbone of several widely-used OPT, e.g., Paraphrasing Tool { https://paraphrasing-tool.com/} and Free Article Spinner { https://free-article-spinner.com/}. Thus, the training set comprises of 8,024 articles (4,012 original, 4,012 spun). To create a test set for documents, we selected 1,990 Wikipedia articles labeled as good articles at random. To receive this label, articles must be well-written, verifiable, neutral, broad in coverage, stable, and illustrated by media. We paraphrased all articles using the SpinBot API to obtain the test set of 3,980 articles (1,990 original, 1,990 spun). To obtain the training and test sets for paragraphs, we split the original and spun articles from the document training set (8,024) and the document test set (3,980) into paragraphs. We discarded paragraphs with fewer than three sentences, as these typically represented titles or subtitles. The resulting training set consists of 241,187 paragraphs (117,445 original, 123,742 spun); the test set consists of 79,970 paragraphs (39,241 original, 40,729 spun).
Description
  • This data set is comprised of multiple folders. The corpus folder contains raw text used for training and testing in two splits, "document" and "paragraph". The Spun documents and paragraphs are generated using the SpinBot tool ( https://spinbot.com/API). The paragraph split is generated by only selecting paragraphs with 3 or more sentences in the document split. Each folder is divided in mg (i.e., machine generated through SpinBot) and og (i.e., original generated file)

  • The human judgement folder contains the human evaluation between original and spun documents (sample). It also contains the answers (keys) and survey results.

  • The models folder contains the machine learning classifier models for each word embedding technique used (only for document split training). The models were exported using pickle (Python 3.6). The grid search for hyperparameter adjustments is described in the paper.

  • The vector folders (train and test) contains the average of all word vectors for each document and paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e, label mg or og). Each file belong to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file.
Creator
Depositor
  • truas@umich.edu
Contact information
Discipline
Keyword
Citations to related material
  • Foltýnek, T. & Ruas, T. & Scharpf, P. & Meuschke, N. & Schubotz, M. & Grosky, W. & Gipp, B., “Detecting Machine-obfuscated Plagiarism,” in Sustainable Digital Communities, vol. 12051 LNCS, Springer, 2020, pp. 816–827. https://doi.org/10.1007/978-3-030-43687-2_68
Related items in Deep Blue
Resource type
Curation notes
  • Citation to the related conference paper was added on March 23, 2020
Last modified
  • 03/23/2020
Published
  • 12/13/2019
Language
DOI
  • https://doi.org/10.7302/bewj-qx93
License
To Cite this Work:
Foltynek, T., Ruas, T., Scharpf, P., Meuschke, N., Schubotz, M., Grosky, W., Gipp, B. (2019). Detecting Machine-obfuscated Plagiarism [Data set]. University of Michigan - Deep Blue. https://doi.org/10.7302/bewj-qx93

Relationships

Files (Count: 6; Size: 2.8 GB)

Title: Detecting Machine-obfuscated Plagiarism
Authors: Tomas Foltynek, Terry Ruas, Philipp Scharpf, Norman Meuschke, Moritz Schubotz, William Grosky, and Bela Gipp
contact email: truas@umich.edu; ruas@uni-wuppertal.de; foltynek@uni-wuppertal.de
Venue: iConference 2020 - co-hosted by the University of Borås: Swedish School of Library and Information Science, and Oslo Metropolitan University: Department of Archivistics, Library and Information Science.
Year: 2020
===============================================================================
Research Statement:

Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classifiers for distinguishing human-written from machine-paraphrased text. The best performing classification approach achieves an accuracy of 99.0% for documents and 83.4% for paragraphs. Second, we show that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks. Third, we provide a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing. The data and code of our study are openly available.

===============================================================================
Dataset Description:

Training:
4,012 featured articles from the English Wikipedia because they objectively cover a wide range of topics in great breadth and depth {https://en.wikipedia.org/wiki/Wikipedia:Content/assessment}

Testing:
1,990 Wikipedia articles labeled as "good articles" at random. To receive this label, articles must be well-written, verifiable, neutral, broad in coverage, stable, and illustrated by media (https://en.wikipedia.org/wiki/Wikipedia:Content/assessment).

==============================================================================

==============================================================================
Dataset Structure:

[corpus] folder: contains de raw text (No pre-processing) used for train and test in two splits, "document" and "paragraph". The Spun documents and paragraphs are generated using the SpinBot tool (https://spinbot.com/API). The paragraph split is generated by only selecting paragraphs with 3 or more sentences in the document split. Each folder is divided in mg (i.e., machine generated through SpinBot) and og (i.e., original generated file)

wikipedia_documents_train.tar - 4,012 original documents; 4,012 spun documents
wikipedia_paragraphs_train.gz - 98,282 original paragraphs; 102,485 spun paragraphs
wikipedia_documents_test.zip - 1,990 original documents, 1,990 spun documents
wikipedia_paragraphs_test.zip - 39,241 original paragraphs, 40,729 spun paragraphs

[vector.[train|test] folder: contains the average of all word vectors for each document and paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e, label mg or og). Each file belong to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file.

The word embedding technique used is described in the file name with the following structure: --mean-.arff . Where

- d2v - doc2vec
google - word2vec
fasttextnw - fastText without subwording
fasttextsw - fastText with subwording
glove - Glove
use - Universal Sentence Encoder

Details for each technique used can be found in the paper.

- wikipediapar - Wikipedia paragraph split
- wikipediadoc - Wikipedia document split

- train or test

[models] folder: contains the machine learning classifier models for each word embedding technique used (only for document split training). The models were exported using pickle (Python 3.6).

The grid search for hyperparameter adjustments is described in the paper.

Machine Learning models: KNN, SVM, Random Forest, Naive Bayes, and Logistic Regression


[Human judgments] folder - contains the human evaluation between original and spun documents (sample). It also contains the answers (keys) and survey results.
NNNNN.txt - whole document from which an extract was taken for human evaluation
key.txt.zip - information about each case (ORIG/SPUN)
results.xlsx - raw results downloaded from the survey tool (the extracts which humans judged are in the first line)
results-corrected.xlsx - at the very beginning, there was a mistake in one question (wrong extract). These results were excluded

Download All Files (To download individual files, select them in the “Files” panel above)

Total work file size of 2.8 GB may be too large to download directly. Consider using Globus (see below).



Best for data sets > 3 GB. Globus is the platform Deep Blue Data uses to make large data sets available.   More about Globus