Skip to content

Latest commit

 

History

History
266 lines (196 loc) · 19.6 KB

README_en.md

File metadata and controls

266 lines (196 loc) · 19.6 KB

chiVe: Japanese Word Embedding with Sudachi

日本語 README

Abstract

"chiVe" (Sudachi Vector) is a Japanese pre-trained word embedding resource using large-scale corpus and multi-granular tokenization.

Based on the skip-gram algorithm, we used word2vec (gensim) to train the vectors.

For v1.0-v1.2, we used NINJAL Web Japanese Corpus (NWJC) from National Institute for Japanese Language and Linguistics which contains around 100 million web page text as a training corpus. For v1.3, we used texts taken from CommonCrawl.

We used Sudachi by Works Applications for tokenization. We used Sudachi's multi-granular tokenziation results of the corpus to train word vectors.

Data

Data are generously hosted by AWS with their Open Data Sponsorship Program.

Version Min Count Normalized Vocab Text gensim Magnitude
v1.3 mc5 5 o 2,530,791 3.6GB (tar.gz) 2.9GB (tar.gz) -
v1.3 mc15 15 o 1,186,019 1.7GB (tar.gz) 1.3GB (tar.gz) -
v1.3 mc30 30 o 759,011 1.1GB (tar.gz) 0.8GB (tar.gz) -
v1.3 mc90 90 o 410,533 0.6GB (tar.gz) 0.5GB (tar.gz) -
v1.2 mc5 5 o 3,197,456 9.2GB (tar.gz) 3.8GB (tar.gz) 5.5GB (.magnitude)
v1.2 mc15 15 o 1,454,280 5.0GB (tar.gz) 1.7GB (tar.gz) 2.4GB (.magnitude)
v1.2 mc30 30 o 912,550 3.1GB (tar.gz) 1.1GB (tar.gz) 1.5GB (.magnitude)
v1.2 mc90 90 o 482,223 1.7GB (tar.gz) 0.6GB (tar.gz) 0.8GB (.magnitude)
v1.1 mc5 5 o 3,196,481 11GB (tar.gz) 3.6GB (tar.gz) 5.5GB (.magnitude)
v1.1 mc15 15 o 1,452,205 4.7GB (tar.gz) 1.7GB (tar.gz) 2.4GB (.magnitude)
v1.1 mc30 30 o 910,424 3.0GB (tar.gz) 1.1GB (tar.gz) 1.5GB (.magnitude)
v1.1 mc90 90 o 480,443 1.6GB (tar.gz) 0.6GB (tar.gz) 0.8GB (.magnitude)
v1.0 mc5 5 x 3,644,628 12GB (tar.gz) 4.1GB (tar.gz) 6.3GB (.magnitude)

All vectors have 300 dimensions.

"Min Count" indicates the number of minimum appearance count in the training corpus (min_count in gensim).

"Normalized" indicates if the text is normalized using the tokenizer Sudachi. For example, words 空き缶, 空缶, 空き罐, 空罐, 空きカン, 空きかん will all be normalized to 空き缶.

version Sudachi SudachiDict Training Corpus
v1.3 v0.6.8 20240109-core CommonCrawl (CC-MAIN-2022-40, warc, first 20k files)
v1.2 v0.4.3 20200722-core NWJC
v1.1 v0.3.0 20191030-core NWJC
v1.0 v0.1.1 0.1.1-dictionary-full NWJC

The training algorithm is the same. See training for the details.

"A Unit Only" Resources

These files contain only the SudachiDict A unit words (Not re-training; Simply excluding B unit words, C unit words, and OOV (Out-of-vocabulary) words from the above original resources).

v1.1 mc90 aunit is used for the natural language processing tool spaCy's Japanese models.

Version Vocab Text gensim Magnitude
v1.1 mc5 aunit 322,094 (10.1%) 1.1GB (tar.gz) 0.4GB (tar.gz) 0.5GB (.magnitude)
v1.1 mc15 aunit 276,866 (19.1%) 1.0GB (tar.gz) 0.3GB (tar.gz) 0.4GB (.magnitude)
v1.1 mc30 aunit 242,658 (26.7%) 0.8GB (tar.gz) 0.3GB (tar.gz) 0.4GB (.magnitude)
v1.1 mc90 aunit 189,775 (39.5%) 0.7GB (tar.gz) 0.2GB (tar.gz) 0.3GB (.magnitude)

Continue Training chiVe

Although chiVe can be used as it is, you can continue to train chiVe with your own data to improve the performance of your tasks.

A full model is required for further training. See the tutorial for details on how to use it.

Version gensim (full)
v1.3 mc5 5.5GB (tar.gz)
v1.3 mc15 2.6GB (tar.gz)
v1.3 mc30 1.7GB (tar.gz)
v1.3 mc90 0.9GB (tar.gz)
v1.2 mc5 6.7GB (tar.gz)
v1.2 mc15 3.0GB (tar.gz)
v1.2 mc30 1.9GB (tar.gz)
v1.2 mc90 1.0GB (tar.gz)

Usage

We provide data in 3 formats, namely, Text, gensim, and Magitude.

Text

Data in plain text (original word2vec C format).

480443 300
の -0.08274004 -0.091033645 -0.08744463 -0.14393683 -0.053159036 ...
、 -0.014216528 -0.1027064 -0.07763326 -0.16008057 -0.16116066 ...
て -0.06049706 -0.15483096 0.052628547 -0.12448246 -0.14404581 ...
...

gensim

Data for the library gensim, in KeyedVectors format.

import gensim

vectors = gensim.models.KeyedVectors.load("./chive-1.1-mc90_gensim/chive-1.1-mc90.kv")

"すだち" in vectors # False, because in v1.1 all vocabs are normalized
"酢橘" in vectors # True

vectors["酢橘"]
# array([-5.68204783e-02, -1.26615226e-01,  3.53190415e-02, -3.67305875e-01, ...])

vectors.similarity("酢橘", "徳島")
# 0.3993048

vectors.most_similar("徳島", topn=5)
# [('愛媛', 0.8229734897613525),
# ('徳島県', 0.786933422088623),
# ('高知', 0.7795713543891907),
# ('岡山', 0.7623447179794312),
# ('徳島市', 0.7415297031402588)]

vectors.most_similar(positive=["阿波", "高知"], negative=["徳島"], topn=5)
# [('土佐', 0.620033860206604),
# ('阿波踊り', 0.5988592505455017),
# ('よさこい祭り', 0.5783430337905884),
# ('安芸', 0.564490556716919),
# ('高知県', 0.5591559410095215)]

Magnitude

Data converted for the library Magnitude, using the default parameters, i.e., includes advanced out-of-vocabulary key support using subword information, but does not include approximate nearest neighbours index (equivalent to their Medium).

from pymagnitude import Magnitude

vectors = Magnitude("chive1.1-mc90.magnitude")

"すだち" in vectors # False, because in v1.1 all vocabs are normalized
"酢橘" in vectors # True

vectors.query("すだち") # via Magnitude's OOV feature suing subword information
# array([ 0.03974148,  0.11290773,  0.01493122, -0.05296252,  0.12616251, ...])

vectors.most_similar("すだち", topn=5)
# [('あだち', 0.5930323079944302),
# ('すだ椎', 0.5872662462335323),
# ('だち', 0.5797546444016177),
# ('ムクノキ', 0.46228053338159725),
# ('椨', 0.4482612387097178)]

vectors.similarity("酢橘", "徳島")
# 0.3993048

vectors.most_similar("徳島", topn=5)
# [('愛媛', 0.8229735),
# ('徳島県', 0.78693324),
# ('高知', 0.7795714),
# ('岡山', 0.7623447),
# ('徳島市', 0.7415296)]

vectors.closer_than("徳島", "徳島市")
# ['愛媛', '徳島県', '高知', '岡山']

vectors.most_similar(positive=["阿波", "高知"], negative=["徳島"], topn=5)
# [('土佐', 0.62003386),
# ('阿波踊り', 0.5988593),
# ('よさこい祭り', 0.578343),
# ('安芸', 0.56449056),
# ('高知県', 0.55915594)]

vectors.most_similar_cosmul(positive=["阿波", "高知"], negative=["徳島"], topn=5)
# [('土佐', 0.83830714),
# ('よさこい祭り', 0.82048166),
# ('阿波踊り', 0.8168015),
# ('安芸', 0.80880433),
# ('伊予', 0.80250806)]

You can also download, remote load, or remote stream over HTTP.

from pymagnitude import Magnitude, MagnitudeUtils

# Download
vectors = Magnitude(MagnitudeUtils.download_model("chive-1.1-mc90-aunit", remote_path="https://sudachi.s3-ap-northeast-1.amazonaws.com/chive/"))
 # default download dir: `~/.magnitude/`
 # If the file already downloaded, it won't be downloaded again
 # You can change the download dir using the `download_dir` argument

# Remote Loading
vectors = Magnitude("https://sudachi.s3-ap-northeast-1.amazonaws.com/chive/chive-1.1-mc90-aunit.magnitude")

# Remote Streaming over HTTP
vectors = Magnitude("https://sudachi.s3-ap-northeast-1.amazonaws.com/chive/chive-1.1-mc90-aunit.magnitude", stream=True)
vectors.query("徳島") # Returns the vector quickly, even with no local file downloaded

Licence

v1.0, v1.1, v1.2

Copyright (c) 2020 National Institute for Japanese Language and Linguistics and Works Applications Co., Ltd. All rights reserved.

"chiVe" v1.0, v1.1 and v1.2 are distributed by National Institute for Japanese Langauge and Linguistics and Works Applications Co.,Ltd. under Apache License, Version 2.0.

v1.3

Copyright (c) 2024 Works Applications Co., Ltd. All rights reserved.

"chiVe" v1.3 is distributed by Works Applications Co.,Ltd. under Apache License, Version 2.0.

Slack

We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.

Citing chiVe

We have published a following paper about chiVe;

When citing chiVe in papers, books, or services, please use the follow BibTex entries (Generally, please cite the first paper, (Manabe+ 2019));

@INPROCEEDINGS{manabe2019chive,
    author    = {真鍋陽俊, 岡照晃, 海川祥毅, 髙岡一馬, 内田佳孝, 浅原正幸},
    title     = {複数粒度の分割結果に基づく日本語単語分散表現},
    booktitle = "言語処理学会第25回年次大会(NLP2019)",
    year      = "2019",
    pages     = "NLP2019-P8-5",
    publisher = "言語処理学会",
}
@INPROCEEDINGS{kawamura2020chive,
    author    = {河村宗一郎, 久本空海, 真鍋陽俊, 高岡一馬, 内田佳孝, 岡照晃, 浅原正幸},
    title     = {chiVe 2.0: SudachiとNWJCを用いた実用的な日本語単語ベクトルの実現へ向けて},
    booktitle = "言語処理学会第26回年次大会(NLP2020)",
    year      = "2020",
    pages     = "NLP2020-P6-16",
    publisher = "言語処理学会",
}
@INPROCEEDINGS{hisamoto2020chive,
    author    = {久本空海, 山村崇, 勝田哲弘, 竹林佑斗, 髙岡一馬, 内田佳孝, 岡照晃, 浅原正幸},
    title     = {chiVe: 製品利用可能な日本語単語ベクトル資源の実現へ向けて},
    booktitle = "第16回テキストアナリティクス・シンポジウム",
    year      = "2020",
    pages     = "IEICE-NLC2020-9",
    publisher = "電子情報通信学会",
}