Multi-Sense embeddings through a word sense disambiguation process

Ruas, Terry; Grosky, William; Aizawa, Akiko

Work Description

Title: Multi-Sense embeddings through a word sense disambiguation process Open Access Deposited

Attribute	Value
Methodology	For the benchmark datasets we collected each of them using the original source. For the wikipedia articles we downloaded their dump directly ( https://en.wikipedia.org/wiki/Wikipedia:Database_download)
Description	This data set is a collection of word similarity benchmarks (RG65, MEN3K, Wordsim 353, simlex999, SCWS, yp130, simverb3500) in their original format and converted into a cosine similarity scale. In addition, we have two Wikpedia Dumps from 2010 (April) and 2018 (January) in which we provide the original format (raw words), converted using the techniques described in the paper (MSSA, MSSA-D and MSSA-NR) (title in this repository), and also the word embeddings models for 300d and 1000d using a word2vec implementation. A readme.txt is provided with more details for each file.
Creator	Ruas, Terry Grosky, William Aizawa, Akiko
Depositor	truas@umich.edu
Contact information	truas@umich.edu
Discipline	Other
Keyword	multi-sense embeddings MSSA word2vec wikipedia dump synset natural language processing
Date coverage	2010-04-08
Citations to related material	https://doi.org/10.1016/j.eswa.2019.06.026
Resource type	Dataset
Last modified	08/21/2019
Published	05/15/2019
Language	English
DOI	https://doi.org/10.7302/96kr-q988
License	http://creativecommons.org/licenses/by/4.0/

To Cite this Work:
Ruas, T., Grosky, W., Aizawa, A. (2019). Multi-Sense embeddings through a word sense disambiguation process [Data set], University of Michigan - Deep Blue Data. https://doi.org/10.7302/96kr-q988

Relationships


This work is not a member of any user collections.

Files (Count: 4; Size: 61.3 GB)

Title	Original Upload	File Size	Access	Actions
readme.txt	2019-05-02	2.3 KB	Open Access	View Details Download
Datasets.zip	2019-05-02	1.82 MB	Open Access	View Details Download
wikpedia-2010.zip	2019-05-02	34.2 GB	Open Access	View Details Download Data from Globus
wikpedia-2018.zip	2019-05-02	27.2 GB	Open Access	View Details Download Data from Globus

Please reference the paper if you are using any of its contents:

Multi-Sense Embeddings Through a Word Sense Disambiguation Process
Ruas, T.; Grosky, I.;Aizawa, A.

Contents:
Dataset:
- original: all original datasets used in this paper
- cosine parsed: similarity score normalized [0,1]

Wikipedia Dump (April) 2010:
- wd10_raw_combined - one wikipedia article per line
- wd10_raw_sep - one wikipedia article per document
- wikidump20100408_nbsd_synsets_single.tar - useed to generate synset embeddings (MSSA) - have the notation word#offset#pos
- wikidump20100408_nbsd_synsets_refined.tar - useed to generate synset embeddings (MSSA1R) - have the notation word#offset#pos
- wikidump20100408_nbsd_synsets_refined2.tar - useed to generate synset embeddings (MSSA2R) - have the notation word#offset#pos
- wikidump20100408_nbsd_synsets_single.tar - output of MSSA Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikidump20100408_nbsd_synsets_refined.tar - output of MSSA1R Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikidump20100408_nbsd_synsets_refined2.tar - output of MSSA2R Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikipedia_cdump_20100408.tar - the actual dump (xml cleaned) from 2010
- models - contains of the synset embeddings models trained using word2vec (window: 15; minimum count: 10; hierarchical sampling; cbow; 300 and 1000 dimensions)

Wikipedia Dump (January) 2018:
- wikidump20180120_nbsd_synsets_single.tar - useed to generate synset embeddings (MSSA) - has the notation word#offset#pos
- wikidump20180120_dbsd_synsets_single.tar - useed to generate synset embeddings (MSSA-D) - has the notation word#offset#pos
- wikidump20180120_nbsd_synsets_separate.tar - output of MSSA Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikidump20180120_dbsd_synsets_separate.tar - output of MSSA-D (Dijkistra) Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikipedia_cdump_20180120.tar - the actual dump (xml cleaned) from 2010
- models - contains of the synset embeddings models trained using word2vec (window: 15; minimum count: 10; hierarchical sampling; cbow; 300 and 1000 dimensions)

Update Provenance Log Entries

Download All Files (To download individual files, select them in the “Files” panel above)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to contact us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.