Work Description
Title: Multi-Sense embeddings through a word sense disambiguation process Open Access Deposited
Attribute | Value |
---|---|
Methodology |
|
Description |
|
Creator | |
Depositor |
|
Contact information | |
Discipline | |
Keyword | |
Date coverage |
|
Citations to related material | |
Resource type | |
Last modified |
|
Published |
|
Language | |
DOI |
|
License |
(2019). Multi-Sense embeddings through a word sense disambiguation process [Data set], University of Michigan - Deep Blue Data. https://doi.org/10.7302/96kr-q988
Relationships
- This work is not a member of any user collections.
Files (Count: 4; Size: 61.3 GB)
Thumbnailthumbnail-column | Title | Original Upload | Last Modified | File Size | Access | Actions |
---|---|---|---|---|---|---|
readme.txt | 2019-05-02 | 2.3 KB | Open Access |
|
||
Datasets.zip | 2019-05-02 | 1.82 MB | Open Access |
|
||
wikpedia-2010.zip | 2019-05-02 | 34.2 GB | Open Access |
|
||
wikpedia-2018.zip | 2019-05-02 | 27.2 GB | Open Access |
|
Please reference the paper if you are using any of its contents:
Multi-Sense Embeddings Through a Word Sense Disambiguation Process
Ruas, T.; Grosky, I.;Aizawa, A.
Contents:
Dataset:
- original: all original datasets used in this paper
- cosine parsed: similarity score normalized [0,1]
Wikipedia Dump (April) 2010:
- wd10_raw_combined - one wikipedia article per line
- wd10_raw_sep - one wikipedia article per document
- wikidump20100408_nbsd_synsets_single.tar - useed to generate synset embeddings (MSSA) - have the notation word#offset#pos
- wikidump20100408_nbsd_synsets_refined.tar - useed to generate synset embeddings (MSSA1R) - have the notation word#offset#pos
- wikidump20100408_nbsd_synsets_refined2.tar - useed to generate synset embeddings (MSSA2R) - have the notation word#offset#pos
- wikidump20100408_nbsd_synsets_single.tar - output of MSSA Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikidump20100408_nbsd_synsets_refined.tar - output of MSSA1R Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikidump20100408_nbsd_synsets_refined2.tar - output of MSSA2R Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikipedia_cdump_20100408.tar - the actual dump (xml cleaned) from 2010
- models - contains of the synset embeddings models trained using word2vec (window: 15; minimum count: 10; hierarchical sampling; cbow; 300 and 1000 dimensions)
Wikipedia Dump (January) 2018:
- wikidump20180120_nbsd_synsets_single.tar - useed to generate synset embeddings (MSSA) - has the notation word#offset#pos
- wikidump20180120_dbsd_synsets_single.tar - useed to generate synset embeddings (MSSA-D) - has the notation word#offset#pos
- wikidump20180120_nbsd_synsets_separate.tar - output of MSSA Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikidump20180120_dbsd_synsets_separate.tar - output of MSSA-D (Dijkistra) Algorithm. One document per file - have the notation word \t synset \t offset \t pos
- wikipedia_cdump_20180120.tar - the actual dump (xml cleaned) from 2010
- models - contains of the synset embeddings models trained using word2vec (window: 15; minimum count: 10; hierarchical sampling; cbow; 300 and 1000 dimensions)