Short text dataset for classification and clustering extracted from StackOverflow

Note that:

If you use this short text dataset, please cite our paper:
[1]. 2015NAACL VSM-NLP workshop-"Short Text Clustering via Convolutional Neural Networks"
and acknowledge Kaggle for making the datasets available.
We do not remove any stop words or symbols in the text;
If you run the Classification ACC.m, please run it on 64-bit machine;
Classification is fast, while Clustering is very slow via KMeans on so high-dimensionality text features, about 2 hours once. If you want to run clustering via KMeans, please have a little patience, and we strongly suggest that you directly refer the KMeans results in our paper [1] which reports the average results by running KMeans 500 times;
The demo code can be found at https://github.com/jacoxu/STC2
Please feel free to send me emails (jacoxu@msn.com) if you have any problems in using this package.

./rawText: Raw text, 20,000 titles as short texts
　　-- label_StackOverflow.txt: Each title plus a tag/label at the end;
　　-- title_StackOverflow.txt: Each title on each line;
　　-- vocab_emb_Word2vec_48.vec: Word2vec trained from a large corpus of StackOverflow dataset;
　　-- vocab_emb_Word2vec_48_index.dic: Word2vec index list corresponds with vocab_withIdx.dic;
　　-- vocab_withIdx.dic: Vocabulary index.

./matlab_format: Matlab format of rawText
　　-- StackOverflow.mat: fea is vsm model, and gnd is the label index.

./benchmarks: Contains some benchmarks, such as classfication and clustering
　　-- Classification_ACC.m: Test the classification performance with TF-IDF+SVM, and the ACC is 81.55%
　　-- predict.mexw64: LibSVM libraries;
　　-- svmpredict.mexw64
　　-- svmtrain.mexw64
　　-- train.mexw64
　　-- tf_idf.m: Compute TF-IDF;
　　-- Clustering_ACC_NMI.m: Test the clustering performance with TF-IDF+KMeans, and the ACC is 20.31% and NMI is 15.64% by 500 runs;
　　-- normalize.m: normalize the feature vectors;
　　-- bestMap.m: Permutation mapping function maps each cluster label to the equivalent label from the text data;
　　-- MutualInfo.m: Compute normalized mutual information metric;

20 different labels:
　　1 wordpress
　　2 oracle
　　3 svn
　　4 apache
　　5 excel
　　6 matlab
　　7 visual-studio
　　8 cocoa
　　9 osx
　　10 bash
　　11 spring
　　12 hibernate
　　13 scala
　　14 sharepoint
　　15 ajax
　　16 qt
　　17 drupal
　　18 linq
　　19 haskell
　　20 magento

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Short text dataset for classification and clustering extracted from StackOverflow

Note that:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Short text dataset for classification and clustering extracted from StackOverflow

Note that: