This serves as partial fulfillment to course CZ4045 : Natural Language Processing in Nanyang Technological University, Singapore.
We developed two tokenizers on Stack Overflow posts. One is based on regular expression, and the other is based on Conditional Random Field (CRF).
The difficulty in tokenizing Stack Overflow data is that its content is highly unstructured, composing of both English text and code snippet. The tokenizer designed and developed by the team is capable to
1. tokenize the code section into smaller meaningful units
2. identify irregular name entities such as "Albert Einstein"
3. identify file path like "src/main/resources"
which greatly improved the accuracy of tokenization, thus enhanced the performance of further analysis.
In the end, our CRF-based tokenizer achieved f1 score of 0.9483 on 5-fold cross validation, and regex-based tokenizer achieved f1-score of 0.9653.
Chen Hailin @Chen-Hailin , Deng Yue @spenceryue97 , Liu Hualin @liuhualin333 , Shi Ziji @stevenshi-23
We have tested our program on python 3.
BeautifulSoup 4: pip install bs4
matplotlib pip install matplotlib
nltk: pip install nltk
numpy: pip install numpy
scipy: pip install scipy
scikit-learn: pip install scikit-learn
sklearn: pip install sklearn
sklearn_crfsuite: pip install sklearn_crfsuite
OR
pip install -U -r requirements.txt
-
Please download the data folder and put it under Root
- Download python3 and third party libraries according to previous instruction.
- Run the following command open python interpreter:
python
Then, run the following commands to download nltk resources:
import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
Last, press ctrl + Z to exit.
- Download datasets and put it into Data/ folder according to link given.
- Navigate to SourceCode/ folder:
- Run the following command to tokenize all sentences in dataset:
python3 tokenizer.py
- Run the following command and follow program instruction to run stemmer and POS tagging:
python3 nltk_controller.py
- Run the following command to compute the top 4 keywords in all question posts data:
python3 application.py
all_posts_clean.txt
: contains all question posts which remove tags
all_answers_clean.txt
: contains all answers posts which remove tags
posts_training_clean.txt
: contains training data from question posts with tags removed
answers_training_clean.txt
: contains training data from answers posts with tags removed
posts_manual_tokenized.txt
: contains all annotated training data from question posts
answers_manual_tokenized.txt
: contains all annotated training data from answers posts
all_posts_top_4_keywords.txt
: contains top 4 keywords of all question posts
application.py main application nltk_controller.py use nltk package to do stemming, pos-tagging and section 3.4 tokeniser.py take a “clean“ version of dataset and tokenise both code and text utilities.py utility functions which can be used among scripts evaluation.py evaluation helper function
Results on our annotated corpus:
| | precision | recall | f1-score |
|--------------------|-----------|----------|----------|
| Regex tokenizer | 0.9578 | 0.9729 | 0.9653 |
| CRF tokenizer | 0.9478 | 0.949 | 0.9483 |