Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
hailiang-wang committed Aug 6, 2017
1 parent 9b41a52 commit 7bd35ad
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 101 deletions.
15 changes: 6 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,8 @@

## 安装

因为目前数据包大小大于pypi.python.org支持的最大限制,所以,不能分发到pypi.python.org。
下载链接:[百度网盘](https://pan.baidu.com/s/1i5MM6nb) 密码: 8u98

```
tar xzf insuranceqa_data-xxx.tar.gz # xxx is the version
cd insuranceqa_data-xxx
python setup.py install
pip install --upgrade insuranceqa_data
```

## 问答语料
Expand Down Expand Up @@ -130,11 +125,13 @@ for x in test_data:
(x['qid'], x['question'], x['utterance'], x['label']))

vocab_data = insuranceqa.load_pairs_vocab()
for x in vocab_data:
print('index %s: %s ++$++ %s' % (x, d[x]['zh'], d[x]['en']))
vocab_data['dict_word_to_id']['UNKNOWN']
vocab_data['dict_id_to_word'][0]
vocab_data['tf']
vocab_data['total']
```

```vocab_data```包含```dict_word_to_id```(从word到id), ```dict_id_to_word```(从id到word),```tf```(词频统计)和```total```(单词总数)。 其中,未登录词的标识为```UNKNOWN```,未登录词的id为0。
```vocab_data```包含```dict_word_to_id```(dict, 从word到id), ```dict_id_to_word```(dict, 从id到word),```tf```(dict, 词频统计)和```total```(单词总数)。 其中,未登录词的标识为```UNKNOWN```,未登录词的id为0。

```train_data```, ```test_data``````valid_data``` 的数据格式一样。```qid``` 是问题Id,```question``` 是问题,```utterance``` 是回复,```label``` 如果是 ```[1,0]``` 代表回复是正确答案,```[0,1]``` 代表回复不是正确答案,所以 ```utterance``` 包含了正例和负例的数据。每个问题含有10个负例和1个正例。

Expand Down
94 changes: 2 additions & 92 deletions pypi/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,97 +23,7 @@
欢迎任何进一步增加此数据集的想法。
语料数据
--------
+--------+----------+----------+----------------+
| - | 问题 | 答案 | 词汇(英语) |
+========+==========+==========+================+
| 训练 | 12,889 | 21,325 | 107,889 |
+--------+----------+----------+----------------+
| 验证 | 2,000 | 3354 | 16,931 |
+--------+----------+----------+----------------+
| 测试 | 2,000 | 3308 | 16,815 |
+--------+----------+----------+----------------+
每条数据包括问题的中文,英文,答案的正例,答案的负例。案的正例至少1项,基本上在\ *1-5*\ 条,都是正确答案。答案的负例有\ *200*\ 条,负例根据问题使用检索的方式建立,所以和问题是相关的,但却不是正确答案。
::
{
"INDEX": {
"zh": "中文",
"en": "英文",
"domain": "保险种类",
"answers": [""] # 答案正例列表
"negatives": [""] # 答案负例列表
},
more ...
}
- 训练:\ ``corpus/train.json``
- 验证:\ ``corpus/valid.json``
- 测试:\ ``corpus/test.json``
- 答案:\ ``corpus/answers.json`` 一共有 27,413 个回答,数据格式为
``json``:
::
{
"INDEX": {
"zh": "中文",
"en": "英文"
},
more ...
}
中英文对照文件
~~~~~~~~~~~~~~
问答对
^^^^^^
::
格式 INDEX ++$++ 保险种类 ++$++ 中文 ++$++ 英文
``corpus/train.txt``, ``corpus/valid.txt``, ``corpus/test.txt``.
答案
^^^^
::
格式 INDEX ++$++ 中文 ++$++ 英文
``corpus/answers.txt``
快速开始
--------
在Python环境中,使用pip安装
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: python
pip install --upgrade insuranceqa_data
import insuranceqa_data as insuranceqa
train_data = insuranceqa.load_train()
test_data = insuranceqa.load_train()
valid_data = insuranceqa.load_train()
# valid_data, test_data and train_data share the same properties
for x in train_data:
print('index %s value: %s ++$++ %s ++$++ %s' %
(x, d[x]['zh'], d[x]['en'], d[x]['answers'], d[x]['negatives']))
answers_data = insuranceqa.load_answers()
for x in answers_data:
print('index %s: %s ++$++ %s' % (x, d[x]['zh'], d[x]['en']))
阅读 `详细文档 <https://github.com/Samurais/insuranceqa-corpus-zh>`__
声明
----
Expand Down Expand Up @@ -141,7 +51,7 @@
"""

setup(name='insuranceqa_data',
version='2.0',
version='2.1',
description='Insuranceqa Corpus in Chinese for Machine Learning',
long_description=LONGDOC,
author='Hai Liang Wang',
Expand Down

0 comments on commit 7bd35ad

Please sign in to comment.