Work Description
Title: Detecting Machine-obfuscated Plagiarism Open Access Deposited
Attribute | Value |
---|---|
Methodology |
|
Description |
|
Creator | |
Depositor |
|
Contact information | |
Discipline | |
Keyword | |
Citations to related material |
|
Related items in Deep Blue Documents |
|
Resource type | |
Curation notes |
|
Last modified |
|
Published |
|
Language | |
DOI |
|
License |
(2019). Detecting Machine-obfuscated Plagiarism [Data set], University of Michigan - Deep Blue Data. https://doi.org/10.7302/bewj-qx93
Relationships
- This work is not a member of any user collections.
Files (Count: 6; Size: 2.8 GB)
Thumbnailthumbnail-column | Title | Original Upload | Last Modified | File Size | Access | Actions |
---|---|---|---|---|---|---|
README.txt | 2019-12-04 | 2019-12-13 | 4.77 KB | Open Access |
|
|
corpus.zip | 2019-12-04 | 2019-12-10 | 213 MB | Open Access |
|
|
Human_judgement.zip | 2019-12-16 | 2019-12-16 | 177 KB | Open Access |
|
|
models.zip | 2019-12-04 | 2019-12-04 | 330 MB | Open Access |
|
|
vectors.test.zip | 2019-12-04 | 2019-12-10 | 718 MB | Open Access |
|
|
vectors.train.zip | 2019-12-04 | 2019-12-11 | 1.57 GB | Open Access |
|
Title: Detecting Machine-obfuscated Plagiarism
Authors: Tomas Foltynek, Terry Ruas, Philipp Scharpf, Norman Meuschke, Moritz Schubotz, William Grosky, and Bela Gipp
contact email: truas@umich.edu; ruas@uni-wuppertal.de; foltynek@uni-wuppertal.de
Venue: iConference 2020 - co-hosted by the University of Borås: Swedish School of Library and Information Science, and Oslo Metropolitan University: Department of Archivistics, Library and Information Science.
Year: 2020
===============================================================================
Research Statement:
Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classifiers for distinguishing human-written from machine-paraphrased text. The best performing classification approach achieves an accuracy of 99.0% for documents and 83.4% for paragraphs. Second, we show that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks. Third, we provide a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing. The data and code of our study are openly available.
===============================================================================
Dataset Description:
Training:
4,012 featured articles from the English Wikipedia because they objectively cover a wide range of topics in great breadth and depth {https://en.wikipedia.org/wiki/Wikipedia:Content/assessment}
Testing:
1,990 Wikipedia articles labeled as "good articles" at random. To receive this label, articles must be well-written, verifiable, neutral, broad in coverage, stable, and illustrated by media (https://en.wikipedia.org/wiki/Wikipedia:Content/assessment).
==============================================================================
==============================================================================
Dataset Structure:
[corpus] folder: contains de raw text (No pre-processing) used for train and test in two splits, "document" and "paragraph". The Spun documents and paragraphs are generated using the SpinBot tool (https://spinbot.com/API). The paragraph split is generated by only selecting paragraphs with 3 or more sentences in the document split. Each folder is divided in mg (i.e., machine generated through SpinBot) and og (i.e., original generated file)
wikipedia_documents_train.tar - 4,012 original documents; 4,012 spun documents
wikipedia_paragraphs_train.gz - 98,282 original paragraphs; 102,485 spun paragraphs
wikipedia_documents_test.zip - 1,990 original documents, 1,990 spun documents
wikipedia_paragraphs_test.zip - 39,241 original paragraphs, 40,729 spun paragraphs
[vector.[train|test] folder: contains the average of all word vectors for each document and paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e, label mg or og). Each file belong to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file.
The word embedding technique used is described in the file name with the following structure: --mean-.arff . Where
- d2v - doc2vec
google - word2vec
fasttextnw - fastText without subwording
fasttextsw - fastText with subwording
glove - Glove
use - Universal Sentence Encoder
Details for each technique used can be found in the paper.
- wikipediapar - Wikipedia paragraph split
- wikipediadoc - Wikipedia document split
- train or test
[models] folder: contains the machine learning classifier models for each word embedding technique used (only for document split training). The models were exported using pickle (Python 3.6).
The grid search for hyperparameter adjustments is described in the paper.
Machine Learning models: KNN, SVM, Random Forest, Naive Bayes, and Logistic Regression
[Human judgments] folder - contains the human evaluation between original and spun documents (sample). It also contains the answers (keys) and survey results.
NNNNN.txt - whole document from which an extract was taken for human evaluation
key.txt.zip - information about each case (ORIG/SPUN)
results.xlsx - raw results downloaded from the survey tool (the extracts which humans judged are in the first line)
results-corrected.xlsx - at the very beginning, there was a mistake in one question (wrong extract). These results were excluded