Welcome to the ds4all github repo

Welcome to the ds4all github repo. The easiest way to navigate it is on the associated dynamic website: ds4all.io.

The repo is structured as follows:
├── app
├── supervised_learning
├── pipeline
├── README.md

[app] Contains the source of the web app for Natural Language Quantification, therefore you can:
git clone https://github.com/clemriedel/ds4all
cd ds4all/app
python app.py
Open your browser and go to 0.0.0.0:8000 to run it locally.

Natural Language Quantification permits to find patterns and quantify an amount generated by the patterns. An example is the money generated by topics in the National Science Foundation grants. NLQ can be applied to any activity described by words and quantified by numbers.

Latent Dirichlet Allocation --it takes few seconds to run the command line (lda package); and a lifetime to master the power and flexibility of the algorithm. Blei, Ng and Jordan (Journal of Machine Learning Research 3 (2003) 993-1022) wrote the expression of the posterior:

It might look daring; but thanks to the plate notation, it is really to understand. We solve it using Collapsed Gibbs sampling.

It is important to understand what are the inputs and outputs of the algorithm.

I have analyzed more than 5000 NSF grants in the biological science department. That's an example Award# 1122225 where Susan Marqusee received 1.2 M$ to study single molecule protein folding. My words are the abstracts and my numbers the amount of money generated by the grant. You can upload any corpus of documents in a simple .csv; first column: numbers, second columns words (as they are, my algorithm Natural Language Quantification cleans, removes stop words and do the stemming on the fly).

Input: The term document matrix. It is our observed variable, our W_d,n: each word is added to the term document as follows:

We run nlq and solve the posterior (still on the fly). We draw z_d,n: the per word topic assignement: it's the lists of topics); and Theta_d the per-document topic proportion all the bart histograms for each doument:

Output: The topic table (Theta_d) is a simple and small .csv (1.5 MB). I still feel impressed by how efficiently lda extracted the information and reduced the size of the input (the term document matrix is of about 300 MB) by 200... With this topic table, nlq computes the amount of money generated by each topic and the interaction between topics. It can also found similiraty between documents in a way I judge way more efficient than any other techniques (such as lsa or nmf) because it uncovers the hidden structure from which the whole corpus is built on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to the ds4all github repo

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
app		app
pipeline		pipeline
supervised_learning		supervised_learning
README.md		README.md

harish-gaggar-ck/ds4all

Folders and files

Latest commit

History

Repository files navigation

Welcome to the ds4all github repo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages