Skip to content

Latest commit

 

History

History

homework-3

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Mining Data Streams

The goal of this project is to study and implement TRIÈST, a streaming graph processing algorithm which aims to count the local and global triangles in fully dynamic streams. For the purpose of the assignment, we would focus on implementing both the basic and improved version of the algorithm.

In order to accomplish the task, the following two steps will be performed:

  1. First, implement the reservoir sampling or the Flajolet-Martin technique used in the graph algorithm presented in the paper.
  2. Second, develop the streaming graph algorithm that uses reservoir sampling.

Implementation

This project has been implemented using Python and PySpark, to be able to have a parallelized version of the algorithm when running in a distributed cluster. Additionally, the dependencies NumPy, random, and urllib.request (to download the dataset from remote storage), were used for the implementation.

The selected graph dataset CA-AstroPh, was retrieved from Stanford Network Analysis Platform (SNAP) collection, it contains a collaboration network of papers published on arXiv in the AstroPhysics category.

Results

The results together with an extensive hypertuning using Tensorboard are detailed in the project report.

Authors