Work Description

Title: Enhanced word embeddings using multi-semantic representation through lexical chains Open Access Deposited
Attribute Value
  • Title: Enhanced word embeddings using multi-semantic representation through lexical chains document-vectors: The datasets benchmarks (documents) were converted into vectors using the referenced word embeddings models from this work. The proposed synset embeddings are located under synset-models folder Word embeddings used to parse documents -> document-vectors: word2vec (google news), LDA, glove, fastText, USE, ELMo - Details and descriptions are in the original paper linked to this dataset. synset-models: synset corpus trained into a word2vec implementation (300 dimensions, CBOW training model, window size 15, minimum count 10, hierarchical softmax). Parameters not referenced use their default values ( ) Techniques used: FLLC + MSSA-0R, FLLC + MSSA-1R, FLLC + MSSA-2R FXLC2 + MSSA-0R, FXLC2 + MSSA-1R, FXLC2 + MSSA-2R FXLC4 + MSSA-0R, FXLC4 + MSSA-1R, FXLC4 + MSSA-2R FXLC8 + MSSA-0R, FXLC8 + MSSA-1R, FXLC8 + MSSA-2R The MSSA techniques used are based on the paper of title: Multi-Sense embeddings through a word sense disambiguation process from Ruas, Terry; Grosky, William; Aizawa, Akiko.
  • The relationship between words in a sentence often tell us more about the underlying semantic content of a document than its actual words, individually. Recent publications in the natural language processing arena, more specifically using word embeddings, try to incorporate semantic aspects into their word vector representation by considering the context of words and how they are distributed in a document collection. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II that combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings into a single decoupled system. In short, our approach has three main contributions: (i) unsupervised techniques that fully integrate word embeddings and lexical chains; (ii) a more solid semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. Knowledge-based systems that use natural language text can benefit from our approach to mitigate ambiguous semantic representations provided by traditional statistical approaches. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show that the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems. Github:
Contact information
Funding agency
  • Other Funding Agency
Other Funding agency
  • This work was partially supported by the Science Without Borders Brazilian Government Scholarship Program, CNPq [grant number 205581/2014-5]; Charles Henrique Porto Ferreira was supported by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (Capes), Programa de Doutorado Sanduíche no Exterior (PDSE), [grant number 88881.186965/2018-01].
Date coverage
  • 2019-04-25
Citations to related material
Resource type
Last modified
  • 08/21/2019
  • 05/15/2019
To Cite this Work:
Ruas, T., Ferreira, C., Grosky, W., França, F., Medeiros, D. (2019). Enhanced word embeddings using multi-semantic representation through lexical chains [Data set]. University of Michigan - Deep Blue.


Files (Count: 3; Size: 6.78 GB)

Download All Files (To download individual files, select them in the “Files” panel above)

Total work file size of 6.78 GB may be too large to download directly. Consider using Globus (see below).

Files are ready   Download Data from Globus
Best for data sets > 3 GB. Globus is the platform Deep Blue Data uses to make large data sets available.   More about Globus