Skip to content

harish-gaggar-ck/ds4all

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to the ds4all github repo

Welcome to the ds4all github repo. The easiest way to navigate it is on the associated dynamic website: ds4all.io.

The repo is structured as follows:
├── app
├── supervised_learning
├── pipeline
├── README.md

[app] Contains the source of the web app for Natural Language Quantification, therefore you can:
git clone https://github.com/clemriedel/ds4all
cd ds4all/app
python app.py
Open your browser and go to 0.0.0.0:8000 to run it locally.

Natural Language Quantification permits to find patterns and quantify an amount generated by the patterns. An example is the money generated by topics in the National Science Foundation grants. NLQ can be applied to any activity described by words and quantified by numbers.

Latent Dirichlet Allocation --it takes few seconds to run the command line (lda package); and a lifetime to master the power and flexibility of the algorithm. Blei, Ng and Jordan (Journal of Machine Learning Research 3 (2003) 993-1022) wrote the expression of the posterior:
lda

It might look daring; but thanks to the plate notation, it is really to understand. We solve it using Collapsed Gibbs sampling.

It is important to understand what are the inputs and outputs of the algorithm.

I have analyzed more than 5000 NSF grants in the biological science department. That's an example Award# 1122225 where Susan Marqusee received 1.2 M$ to study single molecule protein folding. My words are the abstracts and my numbers the amount of money generated by the grant. You can upload any corpus of documents in a simple .csv; first column: numbers, second columns words (as they are, my algorithm Natural Language Quantification cleans, removes stop words and do the stemming on the fly).

Input: The term document matrix. It is our observed variable, our Wd,n: each word is added to the term document as follows: lda

We run nlq and solve the posterior (still on the fly). We draw zd,n: the per word topic assignement: it's the lists of topics); and Thetad the per-document topic proportion all the bart histograms for each doument:

lda

Output: The topic table (Thetad) is a simple and small .csv (1.5 MB). I still feel impressed by how efficiently lda extracted the information and reduced the size of the input (the term document matrix is of about 300 MB) by 200... With this topic table, nlq computes the amount of money generated by each topic and the interaction between topics. It can also found similiraty between documents in a way I judge way more efficient than any other techniques (such as lsa or nmf) because it uncovers the hidden structure from which the whole corpus is built on.

lda

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 83.3%
  • Jupyter Notebook 14.9%
  • Other 1.8%