AWS Big Data Project

This project was given to me as an assignment as part of a course during my computer science bechelor dgree. In the course I used AWS and Hadoop to work with large amount of data.

In this assignment I generated a knowledge-base for Hebrew word-prediction system, based on Google 3-Gram Hebrew dataset, using Amazon Elastic Map-Reduce (EMR) and Hadoop. The produced knowledge-base indicates the probability of each word trigram found in the corpus.

In this assignment, I implemented a held out method, named deleted estimation.
Held out estimators divide the training data (the corpus) into two parts, build initial estimates by doing counts on one part, and then use the other pool of held out data to refine those estimates.
The deleted estimation method, for instance, uses a form of two-way cross validation, as follows:

Where:

N is the number of n-grams in the whole corpus.
Nr0 is the number of n-grams occuring r times in the first part of the corpus.
Tr01 is the total number of those n-grams from the first part (those of Nr0) in the second part of the corpus.
Nr1 is the number of n-grams occuring r times in the second part of the corpus.
Tr10 is the total number of those n-grams from the second part (those of Nr1) in the first part of the corpus.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
out/artifacts		out/artifacts
src/main/java		src/main/java
target/classes		target/classes
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AWS Big Data Project

About

Uh oh!

Releases

Packages

Languages

ofekalg/Knowledge-base-for-Word-Prediction

Folders and files

Latest commit

History

Repository files navigation

AWS Big Data Project

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages