Skip to content

Commit

Permalink
reduce data size and lazy load data from github
Browse files Browse the repository at this point in the history
  • Loading branch information
hailiang-wang committed Aug 6, 2017
1 parent 3dbf15c commit 9b41a52
Show file tree
Hide file tree
Showing 14 changed files with 432 additions and 4 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,11 @@ for x in vocab_data:

```vocab_data```包含```dict_word_to_id```(从word到id), ```dict_id_to_word```(从id到word),```tf```(词频统计)和```total```(单词总数)。 其中,未登录词的标识为```UNKNOWN```,未登录词的id为0。

```train_data```, ```test_data``````valid_data``` 的数据格式一样。```qid``` 是问题Id,```question``` 是问题,```utterance``` 是回复,```label``` 如果是 ```[1,0]``` 代表回复是正确答案,```[0,1]``` 代表回复不是正确答案,所以 ```utterance``` 包含了正例和负例的数据,每个问题,含有200个负例,至少含有1个正例,正例数据在1-5个左右。
```train_data```, ```test_data``````valid_data``` 的数据格式一样。```qid``` 是问题Id,```question``` 是问题,```utterance``` 是回复,```label``` 如果是 ```[1,0]``` 代表回复是正确答案,```[0,1]``` 代表回复不是正确答案,所以 ```utterance``` 包含了正例和负例的数据。每个问题含有10个负例和1个正例。

```train_data```含有问题12,889条,数据 ```141779```条,正例:负例 = 1:10
```test_data```含有问题2,000条,数据 ```22000```条,正例:负例 = 1:10
```valid_data```含有问题2,000条,数据 ```22000```条,正例:负例 = 1:10

## 声明

Expand Down
24 changes: 23 additions & 1 deletion pypi/insuranceqa_data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,32 @@
import gzip
import json
curdir = os.path.dirname(os.path.abspath(__file__))
sys.path.append(curdir)
sys.path.insert(0, curdir)

import wget

def load(data_path):
if not os.path.exists(data_path):
# download all pair data
print("\n [insuranceqa_data] downloading data %s ... \n" % "https://github.com/Samurais/insuranceqa-corpus-zh/raw/release/corpus/pairs/iqa.test.json.gz")
wget.download("https://github.com/Samurais/insuranceqa-corpus-zh/raw/release/corpus/pairs/iqa.test.json.gz", out = os.path.join(curdir, 'pairs'))
print("\n [insuranceqa_data] downloading data %s ... \n" % "https://github.com/Samurais/insuranceqa-corpus-zh/raw/release/corpus/pairs/iqa.train.json.gz")
wget.download("https://github.com/Samurais/insuranceqa-corpus-zh/raw/release/corpus/pairs/iqa.train.json.gz", out = os.path.join(curdir, 'pairs'))
print("\n [insuranceqa_data] downloading data %s ... \n" % "https://github.com/Samurais/insuranceqa-corpus-zh/raw/release/corpus/pairs/iqa.valid.json.gz")
wget.download("https://github.com/Samurais/insuranceqa-corpus-zh/raw/release/corpus/pairs/iqa.valid.json.gz", out = os.path.join(curdir, 'pairs'))
print("\n [insuranceqa_data] downloading data %s ... \n" % "https://github.com/Samurais/insuranceqa-corpus-zh/raw/release/corpus/pairs/iqa.vocab.json.gz")
wget.download("https://github.com/Samurais/insuranceqa-corpus-zh/raw/release/corpus/pairs/iqa.vocab.json.gz", out = os.path.join(curdir, 'pairs'))

# download all pool data
print("\n [insuranceqa_data] downloading data %s ... \n" % "https://github.com/Samurais/insuranceqa-corpus-zh/blob/release/corpus/pool/answers.json.gz")
wget.download("https://github.com/Samurais/insuranceqa-corpus-zh/blob/release/corpus/pool/answers.json.gz", out = os.path.join(curdir, 'pool'))
print("\n [insuranceqa_data] downloading data %s ... \n" % "https://github.com/Samurais/insuranceqa-corpus-zh/blob/release/corpus/pool/test.json.gz")
wget.download("https://github.com/Samurais/insuranceqa-corpus-zh/blob/release/corpus/pool/test.json.gz", out = os.path.join(curdir, 'pairs'))
print("\n [insuranceqa_data] downloading data %s ... \n" % "https://github.com/Samurais/insuranceqa-corpus-zh/blob/release/corpus/pool/train.json.gz")
wget.download("https://github.com/Samurais/insuranceqa-corpus-zh/blob/release/corpus/pool/train.json.gz", out = os.path.join(curdir, 'pairs'))
print("\n [insuranceqa_data] downloading data %s ... \n" % "https://github.com/Samurais/insuranceqa-corpus-zh/blob/release/corpus/pool/valid.json.gz")
wget.download("https://github.com/Samurais/insuranceqa-corpus-zh/blob/release/corpus/pool/valid.json.gz", out = os.path.join(curdir, 'pairs'))

with gzip.open(data_path, 'rb') as f:
data = json.loads(f.read())
return data
Expand Down
Empty file.
Binary file removed pypi/insuranceqa_data/pairs/iqa.test.json.gz
Binary file not shown.
Binary file removed pypi/insuranceqa_data/pairs/iqa.train.json.gz
Binary file not shown.
Binary file removed pypi/insuranceqa_data/pairs/iqa.valid.json.gz
Binary file not shown.
Binary file removed pypi/insuranceqa_data/pairs/iqa.vocab.json.gz
Binary file not shown.
Empty file.
Binary file removed pypi/insuranceqa_data/pool/answers.json.gz
Binary file not shown.
Binary file removed pypi/insuranceqa_data/pool/test.json.gz
Binary file not shown.
Binary file removed pypi/insuranceqa_data/pool/train.json.gz
Binary file not shown.
Binary file removed pypi/insuranceqa_data/pool/valid.json.gz
Binary file not shown.
Loading

0 comments on commit 9b41a52

Please sign in to comment.