An end-to-end open-source data stack for crawling and visualizing real estate data, facilitating insights into market trends.
- Introduction
- Prerequisites
- Setup and Run
- Architecture
- Components of the Data Stack
- Data Crawling: Requests
- Data Transformation: DBT, Apache Spark, Trino
- Data Warehousing and Storage: MinioS3, Iceberg, PostgreSQL
- Data Visualization and Analysis: Metabase, Jupyter Notebook
- Project Orchestration: Dagster
- Project Overview
- Visualization
- Acknowledgements
This project is a holistic, open-source data solution crafted to systematically gather real estate data from Ho Chi Minh City and present it in a visual format, empowering users to glean insights into prevailing market trends. By harnessing this versatile data stack, users can efficiently collect and analyze real-time data from diverse sources within the local real estate market. The system offers robust capabilities for data acquisition, processing, storage, and visualization, enabling users to delve into market dynamics, track property trends, and identify lucrative investment opportunities with ease.
Below is a list of technologies used in this project:
Component | Description | URL |
---|---|---|
Docker | Containerization | |
Spark | Big Data processing framework | http://localhost:8061 Master http://localhost:8062 Worker http://localhost:18080 History |
Jupyter Notebook | Interactive computing and data analysis | http://localhost:8888 |
Minio | Object storage service | http://localhost:9001 |
Iceberg | Table format for large-scale data | |
Data Build Tool (DBT) | Data transformation and modeling | |
Dagster | Data orchestrator | http://localhost:3070 |
Trino | Distributed SQL query engine | |
PostgreSQL | OLAP database |
Docker is installed with at least 8GB RAM.
- Pull the project from the repository.
git clone https://github.com/Quocc1/OpenStack
-
Start the Docker engine.
-
CD to the project directory then spin up the docker-compose:
cd OpenStack
- Then run:
make run
Note: Run make help
or refer to the Makefile for details on commands and execution. Use make down
to stop the containers.
If you encounter issues running the Makefile on Windows, refer to this Stack Overflow post for potential solutions.
- Run end-to-end job in Dagster
Select then end_to_end job
copy all the values in end_to_end.yaml to Launchpad then Launch run
The diagram illustrates the conceptual view of the data pipeline (from bottom to top).
- Real estate advertisements are obtained through an API.
- The advertisements are then stored in Minio S3, leveraging Apache Iceberg for efficient data management.
- The data undergoes transformation through each medallion stage:
bronze
,silver
, andgold
, ensuring quality and consistency. - Gold standard data is stored in PostgreSQL for persistent storage.
- Data is visualized with Metabase for analysis and insights, and Jupyter Notebook is utilized for machine learning.
The orchestration of these steps is managed by Dagster, while data transformation is handled by DBT.
The purpose of this project is to offer a comprehensive end-to-end open-source data stack tailored for analyzing real estate trends in Ho Chi Minh City, Vietnam. It aims to seamlessly acquire, process, store, and visualize real estate data specific to the city.
By leveraging this data stack, users can gain valuable insights into the dynamic real estate market of Ho Chi Minh City, enabling informed decision-making, trend analysis, and identification of investment opportunities in the region.
(See details in the Visualization section below)
Data crawling represents the preliminary phase in which raw data is gathered from diverse sources. Within our infrastructure, we employ the following technology:
- Requests: This Python library streamlines the process of making HTTP requests, thereby enabling seamless retrieval of data from APIs and web pages.
API Endpoint: gateway.chotot.com
Here is an example response to a request:
Data transformation involves processing and refining raw data into a structured format suitable for analysis. We leverage the following technologies for this purpose:
-
DBT (Data Build Tool): DBT is utilized for orchestrating the transformation process, enabling the creation of data models and the execution of SQL transformations.
-
Apache Spark: As a powerful distributed computing framework, Apache Spark assists in processing large-scale data efficiently, facilitating complex transformations and computations.
-
Trino (formerly Presto): Trino serves as a distributed SQL query engine, enabling interactive analysis across various data sources.
Representation of Data Flow:
Data warehousing and storage form the foundation for storing and managing processed data. Our data stack incorporates the following technologies:
-
MinioS3: MinioS3 provides object storage capabilities, offering a scalable and cost-effective solution for storing large volumes of data.
-
Iceberg: Iceberg is utilized for managing structured data tables in cloud object stores efficiently, providing features like atomic commits and time travel.
-
PostgreSQL: PostgreSQL serves as our relational database management system, offering robust data storage and querying capabilities.
Connect to PostgreSQL using DBeaver (username: postgres
, password: postgres
):
Connect to MinioS3 via localhost:9001 (username: admin
, password: password
):
Data visualization and analysis is paramount in aiding data exploration and decision-making processes. Our preferred tools for visualization and analysis are:
-
Metabase (Community Edition): Metabase provides a user-friendly interface, facilitating the creation of interactive dashboards and visualizations. This empowers users to effortlessly derive insights from their data.
-
Jupyter Notebook: Jupyter Notebook is another essential tool for data visualization and analysis. It allows users to create and share documents containing live code, equations, visualizations, and narrative text, providing a versatile environment for data exploration and experimentation.
Examples of machine learning in Jupyter Notebook:
Project orchestration involves coordinating and managing the various components and processes within our data pipeline. We employ:
- Dagster: Dagster serves as our project orchestration tool, enabling the definition, scheduling, and monitoring of data workflows with a focus on data quality and reliability.
End-to-end pipeline illustration:
OpenStack/
├── assets/
│ └── pictures
├── code/
│ ├── dbt/
│ │ ├── bronze/
│ │ │ └── model/
│ │ │ └── bronze_raw_data.sql
│ │ ├── silver/
│ │ │ └── model/
│ │ │ └── silver_refined_data.sql
│ │ └── gold/
│ │ └── model/
│ │ └── gold_analytics_data.sql
│ └── real_estate_dagster/
│ └── real_estate_dagster/
│ ├── crawl.py
│ ├── database.py
│ ├── dbt.py
│ ├── end_to_end.py
│ └── ...
├── data/
│ ├── spark/
│ │ └── notebook/Predict_Price_Real_Estate.ipynb
│ └── stage/
│ └── houses.csv
├── docker/
│ ├── metabase/
│ ├── spark_iceberg_dagster_dbt/
│ └── trino/
├── docker-compose.yaml
├── Makefile
└── README.md
real_estate_dagster/
└── real_estate_dagster/
├── crawl.py
├── database.py
├── dbt.py
├── end_to_end.py
└── ...
crawl.py: A Dagster job responsible for retrieving data via an API and storing it into /var/lib/app/stage/houses.csv.
database.py: A Dagster job utilized for initializing databases for Minio, Iceberg, and PostgreSQL.
dbt.py: A Dagster job employed for executing DBT models.
end_to_end.py: This file combines all Dagster jobs, including database.py, crawl.py, and dbt.py, to orchestrate an end-to-end data pipeline.
dbt/
├── bronze/
│ └── model/
│ └── bronze_raw_data.sql
├── silver/
│ └── model/
│ └── silver_refined_data.sql
└── gold/
└── model/
└── gold_analytics_data.sql
bronze_raw_data.sql: SQL model defining transformations for raw data in the bronze layer.
silver_refined_data.sql: SQL model defining transformations for refined data in the silver layer.
gold_analytics_data.sql: SQL model defining transformations for analytics-ready data in the gold layer.
data/
├── spark/
│ └── notebook/Predict_Price_Real_Estate.ipynb
└── stage/
└── houses.csv
Predict_Price_Real_Estate.ipynb: Jupyter Notebook containing code for predicting real estate prices using Spark.
houses.csv: CSV file containing staged real estate data.
For visualization using Metabase, access localhost:3030 (username caobinhoh@gmail.com
and password quoc123
).
After accessing Metabase with the provided credentials, choose the "HCMC Real Estate Insights" dashboard for viewing.
This project draws inspiration and guidance from the following sources:
- ngods-stocks for its valuable insights and inspiration.
- hcmc-houses-analysis for generously providing code for data crawling.