Open Data Stack: Data-driven Real Estate Insights in Ho Chi Minh City

An end-to-end open-source data stack for crawling and visualizing real estate data, facilitating insights into market trends.

Preview

Introduction
- Technologies used
Prerequisites
Setup and Run
Architecture
- Purpose
Components of the Data Stack
- Data Crawling: Requests
- Data Transformation: DBT, Apache Spark, Trino
- Data Warehousing and Storage: MinioS3, Iceberg, PostgreSQL
- Data Visualization and Analysis: Metabase, Jupyter Notebook
- Project Orchestration: Dagster
Project Overview
Visualization
Acknowledgements

Introduction

This project is a holistic, open-source data solution crafted to systematically gather real estate data from Ho Chi Minh City and present it in a visual format, empowering users to glean insights into prevailing market trends. By harnessing this versatile data stack, users can efficiently collect and analyze real-time data from diverse sources within the local real estate market. The system offers robust capabilities for data acquisition, processing, storage, and visualization, enabling users to delve into market dynamics, track property trends, and identify lucrative investment opportunities with ease.

Technologies Used

Below is a list of technologies used in this project:

Component	Description	URL
Docker	Containerization
Spark	Big Data processing framework	http://localhost:8061 `Master` http://localhost:8062 `Worker` http://localhost:18080 `History`
Jupyter Notebook	Interactive computing and data analysis	http://localhost:8888
Minio	Object storage service	http://localhost:9001
Iceberg	Table format for large-scale data
Data Build Tool (DBT)	Data transformation and modeling
Dagster	Data orchestrator	http://localhost:3070
Trino	Distributed SQL query engine
PostgreSQL	OLAP database

Prerequisites

Docker is installed with at least 8GB RAM.

Setup and Run

Pull the project from the repository.

git clone https://github.com/Quocc1/OpenStack

Start the Docker engine.
CD to the project directory then spin up the docker-compose:

cd OpenStack

Then run:

make run

Note: Run make help or refer to the Makefile for details on commands and execution. Use make down to stop the containers.

If you encounter issues running the Makefile on Windows, refer to this Stack Overflow post for potential solutions.

Run end-to-end job in Dagster

Select then end_to_end job

copy all the values in end_to_end.yaml to Launchpad then Launch run

Architecture

The diagram illustrates the conceptual view of the data pipeline (from bottom to top).

Real estate advertisements are obtained through an API.
The advertisements are then stored in Minio S3, leveraging Apache Iceberg for efficient data management.
The data undergoes transformation through each medallion stage: bronze, silver, and gold, ensuring quality and consistency.
Gold standard data is stored in PostgreSQL for persistent storage.
Data is visualized with Metabase for analysis and insights, and Jupyter Notebook is utilized for machine learning.

The orchestration of these steps is managed by Dagster, while data transformation is handled by DBT.

Purpose:

The purpose of this project is to offer a comprehensive end-to-end open-source data stack tailored for analyzing real estate trends in Ho Chi Minh City, Vietnam. It aims to seamlessly acquire, process, store, and visualize real estate data specific to the city.

By leveraging this data stack, users can gain valuable insights into the dynamic real estate market of Ho Chi Minh City, enabling informed decision-making, trend analysis, and identification of investment opportunities in the region.

(See details in the Visualization section below)

Components of the Data Stack

Data Crawling

Data crawling represents the preliminary phase in which raw data is gathered from diverse sources. Within our infrastructure, we employ the following technology:

Requests: This Python library streamlines the process of making HTTP requests, thereby enabling seamless retrieval of data from APIs and web pages.

API Endpoint: gateway.chotot.com

Here is an example response to a request:

Data Transformation

Data transformation involves processing and refining raw data into a structured format suitable for analysis. We leverage the following technologies for this purpose:

DBT (Data Build Tool): DBT is utilized for orchestrating the transformation process, enabling the creation of data models and the execution of SQL transformations.
Apache Spark: As a powerful distributed computing framework, Apache Spark assists in processing large-scale data efficiently, facilitating complex transformations and computations.
Trino (formerly Presto): Trino serves as a distributed SQL query engine, enabling interactive analysis across various data sources.

Representation of Data Flow:

Data Warehousing and Storage

Data warehousing and storage form the foundation for storing and managing processed data. Our data stack incorporates the following technologies:

MinioS3: MinioS3 provides object storage capabilities, offering a scalable and cost-effective solution for storing large volumes of data.
Iceberg: Iceberg is utilized for managing structured data tables in cloud object stores efficiently, providing features like atomic commits and time travel.
PostgreSQL: PostgreSQL serves as our relational database management system, offering robust data storage and querying capabilities.

Connect to PostgreSQL using DBeaver (username: postgres, password: postgres):

Connect to MinioS3 via localhost:9001 (username: admin, password: password):

Data Visualization and Analysis

Data visualization and analysis is paramount in aiding data exploration and decision-making processes. Our preferred tools for visualization and analysis are:

Metabase (Community Edition): Metabase provides a user-friendly interface, facilitating the creation of interactive dashboards and visualizations. This empowers users to effortlessly derive insights from their data.
Jupyter Notebook: Jupyter Notebook is another essential tool for data visualization and analysis. It allows users to create and share documents containing live code, equations, visualizations, and narrative text, providing a versatile environment for data exploration and experimentation.

Examples of machine learning in Jupyter Notebook:

Project Orchestration

Project orchestration involves coordinating and managing the various components and processes within our data pipeline. We employ:

Dagster: Dagster serves as our project orchestration tool, enabling the definition, scheduling, and monitoring of data workflows with a focus on data quality and reliability.

End-to-end pipeline illustration:

Project Overview

OpenStack/
├── assets/
│   └── pictures
├── code/
│   ├── dbt/
│   │   ├── bronze/
│   │   │   └── model/
│   │   │       └── bronze_raw_data.sql
│   │   ├── silver/
│   │   │   └── model/
│   │   │       └── silver_refined_data.sql
│   │   └── gold/
│   │       └── model/
│   │           └── gold_analytics_data.sql
│   └── real_estate_dagster/
│       └── real_estate_dagster/
│           ├── crawl.py
│           ├── database.py
│           ├── dbt.py
│           ├── end_to_end.py
│           └── ...
├── data/
│   ├── spark/
│   │   └── notebook/Predict_Price_Real_Estate.ipynb
│   └── stage/
│       └── houses.csv
├── docker/
│   ├── metabase/
│   ├── spark_iceberg_dagster_dbt/
│   └── trino/
├── docker-compose.yaml
├── Makefile
└── README.md

Overview

real_estate_dagster/
└── real_estate_dagster/
   ├── crawl.py
   ├── database.py
   ├── dbt.py
   ├── end_to_end.py
   └── ...

crawl.py: A Dagster job responsible for retrieving data via an API and storing it into /var/lib/app/stage/houses.csv.

database.py: A Dagster job utilized for initializing databases for Minio, Iceberg, and PostgreSQL.

dbt.py: A Dagster job employed for executing DBT models.

end_to_end.py: This file combines all Dagster jobs, including database.py, crawl.py, and dbt.py, to orchestrate an end-to-end data pipeline.

dbt/
├── bronze/
│   └── model/
│       └── bronze_raw_data.sql
├── silver/
│   └── model/
│       └── silver_refined_data.sql
└── gold/
   └── model/
      └── gold_analytics_data.sql

bronze_raw_data.sql: SQL model defining transformations for raw data in the bronze layer.

silver_refined_data.sql: SQL model defining transformations for refined data in the silver layer.

gold_analytics_data.sql: SQL model defining transformations for analytics-ready data in the gold layer.

data/
├── spark/
│   └── notebook/Predict_Price_Real_Estate.ipynb
└── stage/
   └── houses.csv

Predict_Price_Real_Estate.ipynb: Jupyter Notebook containing code for predicting real estate prices using Spark.

houses.csv: CSV file containing staged real estate data.

Visualization

For visualization using Metabase, access localhost:3030 (username caobinhoh@gmail.com and password quoc123).

After accessing Metabase with the provided credentials, choose the "HCMC Real Estate Insights" dashboard for viewing.

Acknowledgements

This project draws inspiration and guidance from the following sources:

ngods-stocks for its valuable insights and inspiration.
hcmc-houses-analysis for generously providing code for data crawling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Open Data Stack: Data-driven Real Estate Insights in Ho Chi Minh City

Preview

Table of Contents

Introduction

Technologies Used

Prerequisites

Setup and Run

Architecture

Purpose:

Components of the Data Stack

Data Crawling

Data Transformation

Data Warehousing and Storage

Data Visualization and Analysis

Project Orchestration

Project Overview

Overview

Visualization

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Open Data Stack: Data-driven Real Estate Insights in Ho Chi Minh City

Preview

Table of Contents

Introduction

Technologies Used

Prerequisites

Setup and Run

Architecture

Purpose:

Components of the Data Stack

Data Crawling

Data Transformation

Data Warehousing and Storage

Data Visualization and Analysis

Project Orchestration

Project Overview

Overview

Visualization

Acknowledgements