Title: Detecting Machine-obfuscated Plagiarism
Authors: Tomas Foltynek, Terry Ruas, Philipp Scharpf, Norman Meuschke, Moritz Schubotz, William Grosky, and  Bela Gipp
contact email: truas@umich.edu; ruas@uni-wuppertal.de; foltynek@uni-wuppertal.de 
Venue: iConference 2020 - co-hosted by the University of Borås: Swedish School of Library and Information Science, and Oslo Metropolitan University: Department of Archivistics, Library and Information Science.
Year: 2020
===============================================================================
Research Statement:

Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classifiers for distinguishing human-written from machine-paraphrased text. The best performing classification approach achieves an accuracy of 99.0% for documents and 83.4% for paragraphs. Second, we show that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks. Third, we provide a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing. The data and code of our study are openly available.


===============================================================================
Dataset Description:

Training:
4,012 featured articles from the English Wikipedia because they objectively cover a wide range of topics in great breadth and depth {https://en.wikipedia.org/wiki/Wikipedia:Content/assessment}

Testing:
1,990 Wikipedia articles labeled as "good articles" at random. To receive this label, articles must be well-written, verifiable, neutral, broad in coverage, stable, and illustrated by media (https://en.wikipedia.org/wiki/Wikipedia:Content/assessment).

==============================================================================




==============================================================================
Dataset Structure:

[corpus] folder: contains de raw text (No pre-processing) used for train and test in two splits, "document" and "paragraph". The Spun documents and paragraphs are generated using the SpinBot tool (https://spinbot.com/API). The paragraph split is generated by only selecting paragraphs with 3 or more sentences in the document split. Each folder is divided in mg (i.e., machine generated through SpinBot) and og (i.e., original generated file)

wikipedia_documents_train.tar - 4,012 original documents; 4,012 spun documents
wikipedia_paragraphs_train.gz - 98,282 original paragraphs; 102,485 spun paragraphs
wikipedia_documents_test.zip - 1,990 original documents, 1,990 spun documents
wikipedia_paragraphs_test.zip - 39,241 original paragraphs, 40,729 spun paragraphs

[vector.[train|test] folder: contains the average of all word vectors for each document and paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e, label mg or og). Each file belong to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file.

The word embedding technique used is described in the file name with the following structure: <technique>-<type>-mean-<data>.arff . Where

<technique> - 	d2v - doc2vec
				google - word2vec
				fasttextnw - fastText without subwording
				fasttextsw - fastText with subwording
				glove - Glove
				use - Universal Sentence Encoder
				
Details for each technique used can be found in the paper.
				
<type> - wikipediapar - Wikipedia paragraph split
	   - wikipediadoc - Wikipedia document split
	   
<data> - train or test


[models] folder: contains the machine learning classifier models for each word embedding technique used (only for document split training). The models were exported using pickle (Python 3.6). 

The grid search for hyperparameter adjustments is described in the paper.

Machine Learning models: KNN, SVM, Random Forest, Naive Bayes, and Logistic Regression
				
 
[Human judgments] folder - contains the human evaluation between original and spun documents (sample). It also contains the answers (keys) and survey results. 
	NNNNN.txt - whole document from which an extract was taken for human evaluation
	key.txt.zip - information about each case (ORIG/SPUN)
	results.xlsx - raw results downloaded from the survey tool (the extracts which humans judged are in the first line)
	results-corrected.xlsx - at the very beginning, there was a mistake in one question (wrong extract). These results were excluded