Index Catalog // Deep Blue Data

Enhanced word embeddings using multi-semantic representation through lexical chains

Creator:: Ruas, Terry, Ferreira, Charles H. P., Grosky, William, França, Fabrício O., and Medeiros, Débora M. R,
Description:: The relationship between words in a sentence often tell us more about the underlying semantic content of a document than its actual words, individually. Recent publications in the natural language processing arena, more specifically using word embeddings, try to incorporate semantic aspects into their word vector representation by considering the context of words and how they are distributed in a document collection. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II that combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings into a single decoupled system. In short, our approach has three main contributions: (i) unsupervised techniques that fully integrate word embeddings and lexical chains; (ii) a more solid semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. Knowledge-based systems that use natural language text can benefit from our approach to mitigate ambiguous semantic representations provided by traditional statistical approaches. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show that the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems. Github: https://github.com/truas/LexicalChain_Builder
Keyword:: document classification, lexical chains, word embeddings, synset embeddings, chain2vec, and natural language processing
Citation to related publication:: Terry Ruas, Charles Henrique Porto Ferreira, William Grosky, Fabrício Olivetti de França, Débora Maria Rossi de Medeiros, "Enhanced word embeddings using multi-semantic representation through lexical chains", Information Sciences, 2020, https://doi.org/10.1016/j.ins.2020.04.048
Discipline:: Other, Science, and Engineering

Multi-Sense embeddings through a word sense disambiguation process

Creator:: Ruas, Terry, Grosky, William, and Aizawa, Akiko
Description:: This data set is a collection of word similarity benchmarks (RG65, MEN3K, Wordsim 353, simlex999, SCWS, yp130, simverb3500) in their original format and converted into a cosine similarity scale. In addition, we have two Wikpedia Dumps from 2010 (April) and 2018 (January) in which we provide the original format (raw words), converted using the techniques described in the paper (MSSA, MSSA-D and MSSA-NR) (title in this repository), and also the word embeddings models for 300d and 1000d using a word2vec implementation. A readme.txt is provided with more details for each file.
Keyword:: multi-sense embeddings, MSSA, word2vec, wikipedia dump, synset, and natural language processing
Citation to related publication:: https://doi.org/10.1016/j.eswa.2019.06.026
Discipline:: Other

Detecting Machine-obfuscated Plagiarism

Creator:: Foltynek, Tomas, Ruas, Terry, Scharpf, Philipp , Meuschke, Norman, Schubotz, Moritz , Grosky, William , and Gipp, Bela
Description:: This data set is comprised of multiple folders. The corpus folder contains raw text used for training and testing in two splits, "document" and "paragraph". The Spun documents and paragraphs are generated using the SpinBot tool ( https://spinbot.com/API). The paragraph split is generated by only selecting paragraphs with 3 or more sentences in the document split. Each folder is divided in mg (i.e., machine generated through SpinBot) and og (i.e., original generated file), The human judgement folder contains the human evaluation between original and spun documents (sample). It also contains the answers (keys) and survey results. , The models folder contains the machine learning classifier models for each word embedding technique used (only for document split training). The models were exported using pickle (Python 3.6). The grid search for hyperparameter adjustments is described in the paper. , and The vector folders (train and test) contains the average of all word vectors for each document and paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e, label mg or og). Each file belong to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file.
Keyword:: paraphrase detection, plagiarism detection, document classification, and word embeddings
Citation to related publication:: Foltýnek, T. & Ruas, T. & Scharpf, P. & Meuschke, N. & Schubotz, M. & Grosky, W. & Gipp, B., “Detecting Machine-obfuscated Plagiarism,” in Sustainable Digital Communities, vol. 12051 LNCS, Springer, 2020, pp. 816–827. https://doi.org/10.1007/978-3-030-43687-2_68
Discipline:: General Information Sources

Enhanced word embeddings using multi-semantic representation through lexical chains

Multi-Sense embeddings through a word sense disambiguation process

Detecting Machine-obfuscated Plagiarism

Limit your search

Resource type

Creator

Discipline

Language

Search Results

Search Constraints

Search Results

Enhanced word embeddings using multi-semantic representation through lexical chains

Multi-Sense embeddings through a word sense disambiguation process

Detecting Machine-obfuscated Plagiarism

Limit your search

Resource type

Creator

Discipline

Language