Work Description

Title: Multi-Sense embeddings through a word sense disambiguation process Open Access Deposited

h
Attribute Value
Methodology
Description
  • This data set is a collection of word similarity benchmarks (RG65, MEN3K, Wordsim 353, simlex999, SCWS, yp130, simverb3500) in their original format and converted into a cosine similarity scale. In addition, we have two Wikpedia Dumps from 2010 (April) and 2018 (January) in which we provide the original format (raw words), converted using the techniques described in the paper (MSSA, MSSA-D and MSSA-NR) (title in this repository), and also the word embeddings models for 300d and 1000d using a word2vec implementation. A readme.txt is provided with more details for each file.
Creator
Depositor
  • truas@umich.edu
Contact information
Discipline
Keyword
Date coverage
  • 2010-04-08
Citations to related material
Resource type
Last modified
  • 08/21/2019
Published
  • 05/15/2019
Language
DOI
  • https://doi.org/10.7302/96kr-q988
License
To Cite this Work:
Ruas, T., Grosky, W., Aizawa, A. (2019). Multi-Sense embeddings through a word sense disambiguation process [Data set], University of Michigan - Deep Blue Data. https://doi.org/10.7302/96kr-q988

Relationships

This work is not a member of any user collections.

Files (Count: 4; Size: 61.3 GB)

Please reference the paper if you are using any of its contents:

Multi-Sense Embeddings Through a Word Sense Disambiguation Process
Ruas, T.; Grosky, I.;Aizawa, A.

Contents:
Dataset:
- original: all original datasets used in this paper
- cosine parsed: similarity score normalized [0,1]

Wikipedia Dump (April) 2010:
- wd10_raw_combined - one wikipedia article per line
- wd10_raw_sep - one wikipedia article per document
- wikidump20100408_nbsd_synsets_single.tar - useed to generate synset embeddings (MSSA) - have the notation word#offset#pos
- wikidump20100408_nbsd_synsets_refined.tar - useed to generate synset embeddings (MSSA1R) - have the notation word#offset#pos
- wikidump20100408_nbsd_synsets_refined2.tar - useed to generate synset embeddings (MSSA2R) - have the notation word#offset#pos
- wikidump20100408_nbsd_synsets_single.tar - output of MSSA Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikidump20100408_nbsd_synsets_refined.tar - output of MSSA1R Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikidump20100408_nbsd_synsets_refined2.tar - output of MSSA2R Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikipedia_cdump_20100408.tar - the actual dump (xml cleaned) from 2010
- models - contains of the synset embeddings models trained using word2vec (window: 15; minimum count: 10; hierarchical sampling; cbow; 300 and 1000 dimensions)

Wikipedia Dump (January) 2018:
- wikidump20180120_nbsd_synsets_single.tar - useed to generate synset embeddings (MSSA) - has the notation word#offset#pos
- wikidump20180120_dbsd_synsets_single.tar - useed to generate synset embeddings (MSSA-D) - has the notation word#offset#pos
- wikidump20180120_nbsd_synsets_separate.tar - output of MSSA Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikidump20180120_dbsd_synsets_separate.tar - output of MSSA-D (Dijkistra) Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikipedia_cdump_20180120.tar - the actual dump (xml cleaned) from 2010
- models - contains of the synset embeddings models trained using word2vec (window: 15; minimum count: 10; hierarchical sampling; cbow; 300 and 1000 dimensions)

Download All Files (To download individual files, select them in the “Files” panel above)

Total work file size of 61.3 GB is too large to download directly. Consider using Globus (see below).

Files are ready   Download Data from Globus
Best for data sets > 3 GB. Globus is the platform Deep Blue Data uses to make large data sets available.   More about Globus

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.