Enhanced word embeddings using multi-semantic representation through lexical chains

Ruas, Terry; Ferreira, Charles H. P.; Grosky, William; França, Fabrício O.; Medeiros, Débora M. R,

Work Description

Title: Enhanced word embeddings using multi-semantic representation through lexical chains Open Access Deposited

Attribute	Value
Methodology	Title: Enhanced word embeddings using multi-semantic representation through lexical chains document-vectors: The datasets benchmarks (documents) were converted into vectors using the referenced word embeddings models from this work. The proposed synset embeddings are located under synset-models folder Word embeddings used to parse documents -> document-vectors: word2vec (google news), LDA, glove, fastText, USE, ELMo - Details and descriptions are in the original paper linked to this dataset. synset-models: synset corpus trained into a word2vec implementation (300 dimensions, CBOW training model, window size 15, minimum count 10, hierarchical softmax). Parameters not referenced use their default values ( https://radimrehurek.com/gensim/models/word2vec.html ) Techniques used: FLLC + MSSA-0R, FLLC + MSSA-1R, FLLC + MSSA-2R FXLC2 + MSSA-0R, FXLC2 + MSSA-1R, FXLC2 + MSSA-2R FXLC4 + MSSA-0R, FXLC4 + MSSA-1R, FXLC4 + MSSA-2R FXLC8 + MSSA-0R, FXLC8 + MSSA-1R, FXLC8 + MSSA-2R The MSSA techniques used are based on the paper of title: Multi-Sense embeddings through a word sense disambiguation process from Ruas, Terry; Grosky, William; Aizawa, Akiko. https://github.com/truas/MSSA
Description	The relationship between words in a sentence often tell us more about the underlying semantic content of a document than its actual words, individually. Recent publications in the natural language processing arena, more specifically using word embeddings, try to incorporate semantic aspects into their word vector representation by considering the context of words and how they are distributed in a document collection. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II that combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings into a single decoupled system. In short, our approach has three main contributions: (i) unsupervised techniques that fully integrate word embeddings and lexical chains; (ii) a more solid semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. Knowledge-based systems that use natural language text can benefit from our approach to mitigate ambiguous semantic representations provided by traditional statistical approaches. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show that the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems. Github: https://github.com/truas/LexicalChain_Builder
Creator	Ruas, Terry Ferreira, Charles H. P. Grosky, William França, Fabrício O. Medeiros, Débora M. R,
Depositor	[email protected]
Contact information	[email protected]
Discipline	Other Science Engineering
Funding agency	Other Funding Agency
Other Funding agency	This work was partially supported by the Science Without Borders Brazilian Government Scholarship Program, CNPq [grant number 205581/2014-5]; Charles Henrique Porto Ferreira was supported by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (Capes), Programa de Doutorado Sanduíche no Exterior (PDSE), [grant number 88881.186965/2018-01].
Keyword	document classification lexical chains word embeddings synset embeddings chain2vec natural language processing
Date coverage	2019-04-25
Citations to related material	Terry Ruas, Charles Henrique Porto Ferreira, William Grosky, Fabrício Olivetti de França, Débora Maria Rossi de Medeiros, "Enhanced word embeddings using multi-semantic representation through lexical chains", Information Sciences, 2020, https://doi.org/10.1016/j.ins.2020.04.048
Resource type	Dataset
Curation notes	Added a citation in the "Citation to Related Materials" metadata field on May 8, 2020.
Last modified	05/08/2020
Published	05/15/2019
Language	English
DOI	https://doi.org/10.7302/3bp3-wa36
License	http://creativecommons.org/licenses/by/4.0/

To Cite this Work:
Ruas, T., Ferreira, C. H. P., Grosky, W., França, F. O., Medeiros, D. M. R. (2019). Enhanced word embeddings using multi-semantic representation through lexical chains [Data set], University of Michigan - Deep Blue Data. https://doi.org/10.7302/3bp3-wa36

Relationships


This work is not a member of any user collections.

Files (Count: 3; Size: 6.78 GB)

Title	Original Upload	Last Modified	File Size	Access	Actions
readme.txt	2019-05-02	2019-05-15	1.26 KB	Open Access	View Details Download
document-vectors.zip	2019-05-02		5.48 GB	Open Access	View Details Download
synset-models.zip	2019-05-02		1.3 GB	Open Access	View Details Download

AUTHORSHIP
Title: Enhanced word embeddings using multi-semantic representation through lexical chains

Paper under review at ELSEVIER

Authors: Terry Ruas* [1], Charles Henrique Porto Ferreira* [2], William Grosky[1], Fabricio Olivetti de Franca[2], Debora Maria Rossi de Medeiros [2]

Affiliation: [1] University of Michigan - Dearborn, 4901 Evergreen Rd, Dearborn, MI 48128, USA
[2] Federal University of ABC, Av. dos Estados, 5001 - Bangu, Santo Andre, SP, 09210-580,Brazil

*Corresponding authors: [email protected] (Terry Ruas), [email protected] (Charles Henrique Porto Ferreira)
==================================================================================

==================================================================================
STRUCTURE/INVENTORY

Folder: documents.zip
Content: --mean.arff (train/test when available)
.arff can be read as a text (.txt) file (UTF-8)
comma separated (",")
.arff file contains the vector representaion for each dataset used
Each line represents a document in the dataset references, in which all the words are cleaned (lowercase, common English stopwords removal, remove punctuation) transformed into vector (according to the and averaged.
Do not use .arff in Weka (these files have no header)

: 20Newsgroups - (train/test) http://qwone.com/ jason/20Newsgroups/
BBC - http://mlg.ucd.ie/datasets/bbc.html
Ohsumed - http://disi.unitn.it/moschitti/corpora.htm
Reuters (train/test) - http://disi.unitn.it/moschitti/corpora.htm
ScyClusters and ScyGenes - [3], [4]

:
Last column contains the label for that line-document, representing the class of this document
ELMo https://tfhub.dev/google/elmo/2
fastText https://fasttext.cc/docs/en/english-vectors.html
GloVe https://nlp.stanford.edu/projects/glove/
Google (word2vec) https://code.google.com/archive/p/word2vec/
LDA implemented using gensim https://radimrehurek.com/gensim/models/ldamodel.html using as a training corpus [5]
USE https://tfhub.dev/google/universal-sentence-encoder/2
MSSA: https://github.com/truas/MSSA and https://github.com/truas/MSSA_Parser using as a training corpus [5]
FXLC/FLLC: https://github.com/truas/LexicalChain_Builder using as a training corpus [5]

Folder: synset-models.zip
Content: pre-trained word embeddings models trained with a synset corpus (MSSA, FLLC and FXLC output) in a word2vec implementation
.model
.model.vectors.npy

Use: These files can be loaded using the gensim API for word2vec (load)
https://radimrehurek.com/gensim/models/word2vec.html

: MSSA-NR: https://github.com/truas/MSSA and https://github.com/truas/MSSA_Parser
FXLC/FLLC: https://github.com/truas/LexicalChain_Builder
FXLC uses chunk size of 2 and 4
The github repositories have detailed informaton how to generate/apply these techniques.
*.model: binary model exported by the word embeddings techniques.
*.model.vectors.npy: numpy vector that needs to be in the same folder as its *.model

=================================================================================
INFORMATION

Overview:
The relationship between words in a sentence often tell us more about the underlying semantic content of a document than its actual words, individually. Recent publications in the natural language processing arena, more specifically using word embeddings, try to incorporate semantic aspects into their word vector representation by considering the context of words and how they are distributed in a document collection. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II that combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings into a single decoupled system. In short, our approach has three main contributions: (i) unsupervised techniques that fully integrate word embeddings and lexical chains; (ii) a more solid semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. Knowledge-based systems that use natural language text can benefit from our approach to mitigate ambiguous semantic representations provided by traditional statistical approaches. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show that the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems.

Keywords: Lexical chains, natural language processing, word embeddings,
document classification, synsets

METHODOLOGY
Each line in the .arff represents a document in the dataset, in which all the words are cleaned (lowercase, common English stopwords removal, remove punctuation) transformed into vector (according to the and averaged. The final result is a vector-average that represents the entire document. More information on the techniques () used can be found in the DETAILS section.

The MSSA, FLLC, and FXLC synset models are produced considering the English Wikipedia Dump from 2010.

The pre-trained word embeddings models are referenced in (above)

DEFINITIONS
model - word embeddings model applied
mean - vectors for a given are averaged instead of summed
synset - set of synonyms
csv - comma separated values

=================================================================================

DETAILS

document-vectors: The datasets benchmarks (documents) were converted into vectors using the referenced word embeddings models from this work. The proposed synset embeddings are located under synset-models folder
Word embeddings used to parse documents -> document-vectors: word2vec (google news), LDA, Glove, fastText, USE, ELMo - Details and descriptions are in the original paper linked to this dataset.
300 dimensions: word2vec (google news), LDA, Glove, fastText, MSSA, FLLC and FXLC vectors
512 dimensions: USE
1024 dimesions: ELMo
Each line represents a document
Last column represent the label of these documents, for each

synset-models: synset corpus trained into a word2vec implementation (300 dimensions, CBOW training model, window size 15, minimum count 10, hierarchical softmax). Parameters not referenced use their default values (https://radimrehurek.com/gensim/models/word2vec.html )

Techniques used*: FLLC + MSSA-0R, FLLC + MSSA-1R, FLLC + MSSA-2R
FXLC2 + MSSA-0R, FXLC2 + MSSA-1R, FXLC2 + MSSA-2R
FXLC4 + MSSA-0R, FXLC4 + MSSA-1R, FXLC4 + MSSA-2R
FXLC8 + MSSA-0R, FXLC8 + MSSA-1R, FXLC8 + MSSA-2R - the models for these last group are not provided (poor results)

*The MSSA techniques used are based on the paper of title: Multi-Sense embeddings through a word sense disambiguation process from Ruas, Terry; Grosky, William; Aizawa, Akiko. https://github.com/truas/MSSA

REFERENCES
[3] D. M. R. Medeiros, A. C. P. L. F. Carvalho, Gene clusters analysis using text mining, in: WOB - Third Workshop on Bioinformatics, SBC, Brasilia - DF, 2004, pp. 141–144.

[4] D. M. R. Medeiros, A. C. P. L. F. Carvalho, Applying text mining and machine learning techniques to gene clusters analysis, in: ICCIMA ’05: 1220 Proceedings of the Sixth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA’05), IEEE Computer Society, Washington, DC, USA, 2005, pp. 23–28.

[5] C. Shaoul, C. Westbury, The westbury lab wikipedia corpus, 2010. URL: http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html.

Update Provenance Log Entries

Download All Files (To download individual files, select them in the “Files” panel above)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to contact us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.