chiVe: Japanese Word Embedding with Sudachi

Abstract

"chiVe" (Sudachi Vector) is a Japanese pre-trained word embedding resource using large-scale corpus and multi-granular tokenization.

Based on the skip-gram algorithm, we used word2vec (gensim) to train the vectors.

For v1.0-v1.2, we used NINJAL Web Japanese Corpus (NWJC) from National Institute for Japanese Language and Linguistics which contains around 100 million web page text as a training corpus. For v1.3, we used texts taken from CommonCrawl.

We used Sudachi by Works Applications for tokenization. We used Sudachi's multi-granular tokenziation results of the corpus to train word vectors.

Data

Data are generously hosted by AWS with their Open Data Sponsorship Program.

Version	Min Count	Normalized	Vocab	Text	gensim	Magnitude
v1.3 mc5	5	o	2,530,791	3.6GB (tar.gz)	2.9GB (tar.gz)	-
v1.3 mc15	15	o	1,186,019	1.7GB (tar.gz)	1.3GB (tar.gz)	-
v1.3 mc30	30	o	759,011	1.1GB (tar.gz)	0.8GB (tar.gz)	-
v1.3 mc90	90	o	410,533	0.6GB (tar.gz)	0.5GB (tar.gz)	-

v1.2 mc5	5	o	3,197,456	9.2GB (tar.gz)	3.8GB (tar.gz)	5.5GB (.magnitude)
v1.2 mc15	15	o	1,454,280	5.0GB (tar.gz)	1.7GB (tar.gz)	2.4GB (.magnitude)
v1.2 mc30	30	o	912,550	3.1GB (tar.gz)	1.1GB (tar.gz)	1.5GB (.magnitude)
v1.2 mc90	90	o	482,223	1.7GB (tar.gz)	0.6GB (tar.gz)	0.8GB (.magnitude)

v1.1 mc5	5	o	3,196,481	11GB (tar.gz)	3.6GB (tar.gz)	5.5GB (.magnitude)
v1.1 mc15	15	o	1,452,205	4.7GB (tar.gz)	1.7GB (tar.gz)	2.4GB (.magnitude)
v1.1 mc30	30	o	910,424	3.0GB (tar.gz)	1.1GB (tar.gz)	1.5GB (.magnitude)
v1.1 mc90	90	o	480,443	1.6GB (tar.gz)	0.6GB (tar.gz)	0.8GB (.magnitude)
v1.0 mc5	5	x	3,644,628	12GB (tar.gz)	4.1GB (tar.gz)	6.3GB (.magnitude)

All vectors have 300 dimensions.

"Min Count" indicates the number of minimum appearance count in the training corpus (min_count in gensim).

"Normalized" indicates if the text is normalized using the tokenizer Sudachi. For example, words 空き缶, 空缶, 空き罐, 空罐, 空きカン, 空きかん will all be normalized to 空き缶.

version	Sudachi	SudachiDict	Training Corpus
v1.3	v0.6.8	20240109-core	CommonCrawl (CC-MAIN-2022-40, warc, first 20k files)
v1.2	v0.4.3	20200722-core	NWJC
v1.1	v0.3.0	20191030-core	NWJC
v1.0	v0.1.1	0.1.1-dictionary-full	NWJC

The training algorithm is the same. See training for the details.

"A Unit Only" Resources

These files contain only the SudachiDict A unit words (Not re-training; Simply excluding B unit words, C unit words, and OOV (Out-of-vocabulary) words from the above original resources).

v1.1 mc90 aunit is used for the natural language processing tool spaCy's Japanese models.

Version	Vocab	Text	gensim	Magnitude
v1.1 mc5 aunit	322,094 (10.1%)	1.1GB (tar.gz)	0.4GB (tar.gz)	0.5GB (.magnitude)
v1.1 mc15 aunit	276,866 (19.1%)	1.0GB (tar.gz)	0.3GB (tar.gz)	0.4GB (.magnitude)
v1.1 mc30 aunit	242,658 (26.7%)	0.8GB (tar.gz)	0.3GB (tar.gz)	0.4GB (.magnitude)
v1.1 mc90 aunit	189,775 (39.5%)	0.7GB (tar.gz)	0.2GB (tar.gz)	0.3GB (.magnitude)

Continue Training chiVe

Although chiVe can be used as it is, you can continue to train chiVe with your own data to improve the performance of your tasks.

A full model is required for further training. See the tutorial for details on how to use it.

Version	gensim (full)
v1.3 mc5	5.5GB (tar.gz)
v1.3 mc15	2.6GB (tar.gz)
v1.3 mc30	1.7GB (tar.gz)
v1.3 mc90	0.9GB (tar.gz)

v1.2 mc5	6.7GB (tar.gz)
v1.2 mc15	3.0GB (tar.gz)
v1.2 mc30	1.9GB (tar.gz)
v1.2 mc90	1.0GB (tar.gz)

Usage

We provide data in 3 formats, namely, Text, gensim, and Magitude.

Text

Data in plain text (original word2vec C format).

480443 300
の -0.08274004 -0.091033645 -0.08744463 -0.14393683 -0.053159036 ...
、 -0.014216528 -0.1027064 -0.07763326 -0.16008057 -0.16116066 ...
て -0.06049706 -0.15483096 0.052628547 -0.12448246 -0.14404581 ...
...

gensim

Data for the library gensim, in KeyedVectors format.

import gensim

vectors = gensim.models.KeyedVectors.load("./chive-1.1-mc90_gensim/chive-1.1-mc90.kv")

"すだち" in vectors # False, because in v1.1 all vocabs are normalized
"酢橘" in vectors # True

vectors["酢橘"]
# array([-5.68204783e-02, -1.26615226e-01,  3.53190415e-02, -3.67305875e-01, ...])

vectors.similarity("酢橘", "徳島")
# 0.3993048

vectors.most_similar("徳島", topn=5)
# [('愛媛', 0.8229734897613525),
# ('徳島県', 0.786933422088623),
# ('高知', 0.7795713543891907),
# ('岡山', 0.7623447179794312),
# ('徳島市', 0.7415297031402588)]

vectors.most_similar(positive=["阿波", "高知"], negative=["徳島"], topn=5)
# [('土佐', 0.620033860206604),
# ('阿波踊り', 0.5988592505455017),
# ('よさこい祭り', 0.5783430337905884),
# ('安芸', 0.564490556716919),
# ('高知県', 0.5591559410095215)]

Magnitude

Data converted for the library Magnitude, using the default parameters, i.e., includes advanced out-of-vocabulary key support using subword information, but does not include approximate nearest neighbours index (equivalent to their Medium).

from pymagnitude import Magnitude

vectors = Magnitude("chive1.1-mc90.magnitude")

"すだち" in vectors # False, because in v1.1 all vocabs are normalized
"酢橘" in vectors # True

vectors.query("すだち") # via Magnitude's OOV feature suing subword information
# array([ 0.03974148,  0.11290773,  0.01493122, -0.05296252,  0.12616251, ...])

vectors.most_similar("すだち", topn=5)
# [('あだち', 0.5930323079944302),
# ('すだ椎', 0.5872662462335323),
# ('だち', 0.5797546444016177),
# ('ムクノキ', 0.46228053338159725),
# ('椨', 0.4482612387097178)]

vectors.similarity("酢橘", "徳島")
# 0.3993048

vectors.most_similar("徳島", topn=5)
# [('愛媛', 0.8229735),
# ('徳島県', 0.78693324),
# ('高知', 0.7795714),
# ('岡山', 0.7623447),
# ('徳島市', 0.7415296)]

vectors.closer_than("徳島", "徳島市")
# ['愛媛', '徳島県', '高知', '岡山']

vectors.most_similar(positive=["阿波", "高知"], negative=["徳島"], topn=5)
# [('土佐', 0.62003386),
# ('阿波踊り', 0.5988593),
# ('よさこい祭り', 0.578343),
# ('安芸', 0.56449056),
# ('高知県', 0.55915594)]

vectors.most_similar_cosmul(positive=["阿波", "高知"], negative=["徳島"], topn=5)
# [('土佐', 0.83830714),
# ('よさこい祭り', 0.82048166),
# ('阿波踊り', 0.8168015),
# ('安芸', 0.80880433),
# ('伊予', 0.80250806)]

You can also download, remote load, or remote stream over HTTP.

from pymagnitude import Magnitude, MagnitudeUtils

# Download
vectors = Magnitude(MagnitudeUtils.download_model("chive-1.1-mc90-aunit", remote_path="https://sudachi.s3-ap-northeast-1.amazonaws.com/chive/"))
 # default download dir: `~/.magnitude/`
 # If the file already downloaded, it won't be downloaded again
 # You can change the download dir using the `download_dir` argument

# Remote Loading
vectors = Magnitude("https://sudachi.s3-ap-northeast-1.amazonaws.com/chive/chive-1.1-mc90-aunit.magnitude")

# Remote Streaming over HTTP
vectors = Magnitude("https://sudachi.s3-ap-northeast-1.amazonaws.com/chive/chive-1.1-mc90-aunit.magnitude", stream=True)
vectors.query("徳島") # Returns the vector quickly, even with no local file downloaded

Licence

v1.0, v1.1, v1.2

"chiVe" v1.0, v1.1 and v1.2 are distributed by National Institute for Japanese Langauge and Linguistics and Works Applications Co.,Ltd. under Apache License, Version 2.0.

v1.3

"chiVe" v1.3 is distributed by Works Applications Co.,Ltd. under Apache License, Version 2.0.

Slack

We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.

https://sudachi-dev.slack.com/
(Please get an invite from here)

Citing chiVe

We have published a following paper about chiVe;

真鍋陽俊, 岡照晃, 海川祥毅, 髙岡一馬, 内田佳孝, 浅原正幸. 複数粒度の分割結果に基づく日本語単語分散表現 (Japanese Word Embedding based on Multi-granular Tokenization Results, in Japanese). 言語処理学会第 25 回年次大会, 2019.
河村宗一郎, 久本空海, 真鍋陽俊, 高岡一馬, 内田佳孝, 岡照晃, 浅原正幸. chiVe 2.0: Sudachi と NWJC を用いた実用的な日本語単語ベクトルの実現へ向けて (chiVe 2.0: Towards Prctical Japanese Embedding wiht Sudachi and NWJC, in Japanese). 言語処理学会第 26 回年次大会, 2020.
久本空海, 山村崇, 勝田哲弘, 竹林佑斗, 髙岡一馬, 内田佳孝, 岡照晃, 浅原正幸. chiVe: 製品利用可能な日本語単語ベクトル資源の実現へ向けて (chiVe: Towards Industrial-strength Japanese Word Vector Resources, in Japanese). 第 16 回テキストアナリティクス・シンポジウム, 2020. (slides)

When citing chiVe in papers, books, or services, please use the follow BibTex entries (Generally, please cite the first paper, (Manabe+ 2019));

@INPROCEEDINGS{manabe2019chive,
    author    = {真鍋陽俊, 岡照晃, 海川祥毅, 髙岡一馬, 内田佳孝, 浅原正幸},
    title     = {複数粒度の分割結果に基づく日本語単語分散表現},
    booktitle = "言語処理学会第25回年次大会(NLP2019)",
    year      = "2019",
    pages     = "NLP2019-P8-5",
    publisher = "言語処理学会",
}

@INPROCEEDINGS{kawamura2020chive,
    author    = {河村宗一郎, 久本空海, 真鍋陽俊, 高岡一馬, 内田佳孝, 岡照晃, 浅原正幸},
    title     = {chiVe 2.0: SudachiとNWJCを用いた実用的な日本語単語ベクトルの実現へ向けて},
    booktitle = "言語処理学会第26回年次大会(NLP2020)",
    year      = "2020",
    pages     = "NLP2020-P6-16",
    publisher = "言語処理学会",
}

@INPROCEEDINGS{hisamoto2020chive,
    author    = {久本空海, 山村崇, 勝田哲弘, 竹林佑斗, 髙岡一馬, 内田佳孝, 岡照晃, 浅原正幸},
    title     = {chiVe: 製品利用可能な日本語単語ベクトル資源の実現へ向けて},
    booktitle = "第16回テキストアナリティクス・シンポジウム",
    year      = "2020",
    pages     = "IEICE-NLC2020-9",
    publisher = "電子情報通信学会",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

chiVe: Japanese Word Embedding with Sudachi

Abstract

Data

"A Unit Only" Resources

Continue Training chiVe

Usage

Text

gensim

Magnitude

Licence

v1.0, v1.1, v1.2

v1.3

Slack

Citing chiVe

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

chiVe: Japanese Word Embedding with Sudachi

Abstract

Data

"A Unit Only" Resources

Continue Training chiVe

Usage

Text

gensim

Magnitude

Licence

v1.0, v1.1, v1.2

v1.3

Slack

Citing chiVe