-
Notifications
You must be signed in to change notification settings - Fork 1
/
Readme.txt
89 lines (66 loc) · 3.17 KB
/
Readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
CZ4045 Natural Language Processing Project 1 README
- Contributors:
Chen Hailin
Deng Yue
Liu Hualin
Shi Ziji
— Dependencies:
Python 3.6
BeautifulSoup 4
matplotlib
nltk
numpy
scipy
scikit-learn
sklearn
sklearn_crfsuite
— Third-party Libraries Commands:(use pip3 install if default is python 2.7 pip)
BeautifulSoup 4: pip install bs4
matplotlib pip install matplotlib
nltk: pip install nltk
numpy: pip install numpy
scipy: pip install scipy
scikit-learn: pip install scikit-learn
sklearn: pip install sklearn
sklearn_crfsuite: pip install sklearn_crfsuite
OR
pip install -U -r requirements.txt
- Dataset Download Link(Please download the data folder and put it under Root)
Data folder: https://drive.google.com/open?id=1Na1gK7uqZkhbiwmi1DWThBBmUhrkzNwH
Dataset post link: https://drive.google.com/open?id=190DqYXS8wDPmAB2UM20vHSOKUiRN0fNW
Dataset answer link: https://drive.google.com/open?id=1CcssLW8sSC-KE_sAflbXk93d6ZbxpYGj
Annotated post link: https://drive.google.com/open?id=0B1rcXBqgX69sbGZpUTZobk5hcDQ
Annotated answer link: https://drive.google.com/open?id=0B1rcXBqgX69sbTB3SFVaVXItWFE
— Installation Guide
1. Download python3 and third party libraries according to previous instruction.
2. Run the following command open python interpreter:
python
Then, run the following commands to download nltk resources:
import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
Last, press ctrl + Z to exit.
3. Download datasets and put it into Data/ folder according to link given.
4. Navigate to SourceCode/ folder:
5. Run the following command to tokenize all sentences in dataset:
python3 tokenizer.py
6. Run the following command and follow program instruction to run stemmer and POS tagging:
python3 nltk_controller.py
7. Run the following command to compute the top 4 keywords in all question posts data:
python3 application.py
- Explanations of data
all_posts_clean.txt: contains all question posts which remove tags
all_answers_clean.txt: contains all answers posts which remove tags
posts_training_clean.txt: contains training data from question posts with tags removed
answers_training_clean.txt: contains training data from answers posts with tags removed
posts_manual_tokenized.txt: contains all annotated training data from question posts
answers_manual_tokenized.txt: contains all annotated training data from answers posts
all_posts_top_4_keywords.txt: contains top 4 keywords of all question posts
— Explanations of sourcecode
application.py main application
nltk_controller.py use nltk package to do stemming, pos-tagging and section
3.4
tokeniser.py take a “clean“ version of dataset and tokenise both code
and text
utilities.py utility functions which can be used among scripts
evaluation.py evaluation helper function