Skip to content

This repository provides a modular and easy-to-extend ETL pipeline that streams data from a PostgreSQL database into a StarRocks data warehouse using RisingWave as the real-time streaming computation layer.

License

Notifications You must be signed in to change notification settings

dwickyferi/etl-postgres-to-starrocks-via-risingwave

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL Pipeline: PostgreSQL to StarRocks via RisingWave 🚀

PostgreSQL Docker Python License: MIT

📝 Table of Contents

🎯 About

This repository provides a robust and scalable ETL (Extract, Transform, Load) pipeline that enables real-time data streaming from PostgreSQL to StarRocks data warehouse using RisingWave as the stream processing engine. The solution is designed to be modular, easy to extend, and production-ready.

🏗️ Architecture

PostgreSQL (Source) → RisingWave (Stream Processing) → StarRocks (Data Warehouse)
  • PostgreSQL: Source database with transaction data
  • RisingWave: Real-time stream processing and transformation layer
  • StarRocks: High-performance analytical data warehouse

🛠️ Prerequisites

  • Docker and Docker Compose
  • PostgreSQL 13+
  • RisingWave 1.0+
  • StarRocks 3.0+
  • Minimum 8GB RAM
  • 20GB available disk space

🚀 Quick Start

  1. Clone the repository:
git clone https://github.com/yourusername/etl-postgres-to-starrocks-via-risingwave.git
cd etl-postgres-to-starrocks-via-risingwave
  1. Start the infrastructure:
docker-compose up -d
  1. Initialize the databases:
# Initialize PostgreSQL schema and sample data
psql -h localhost -U postgres -d postgres -f sql/init.sql

# Initialize StarRocks schema
mysql -h localhost -P 9030 -u root -f sql/starrocks.sql

# Configure RisingWave pipeline
psql -h localhost -P 4566 -u root -f sql/risingwave.sql

📖 Detailed Setup Guide

PostgreSQL Configuration

The PostgreSQL database is configured with:

  • Sample tables (customers, products, orders, order_details)
  • CDC (Change Data Capture) enabled
  • Replication slots for RisingWave connectivity
  • Sample data for testing

Refer to sql/init.sql for the complete database schema and initial data setup.

RisingWave Setup

RisingWave is configured to:

  • Capture CDC events from PostgreSQL
  • Transform data through materialized views
  • Stream processed data to StarRocks
  • Handle data type conversions and transformations

Key configurations in sql/risingwave.sql include:

  • PostgreSQL CDC source configuration
  • Materialized views for data transformation
  • StarRocks sink configuration

StarRocks Configuration

StarRocks is set up with:

  • Optimized table schemas for analytical queries
  • Proper data distribution and bucketing
  • Automated data loading from RisingWave

See sql/starrocks.sql for detailed warehouse configuration.

🔄 Data Flow

  1. Data changes in PostgreSQL are captured via CDC
  2. RisingWave processes these changes in real-time
  3. Transformed data is continuously loaded into StarRocks
  4. StarRocks maintains optimized storage for analytical queries

📊 Monitoring

❗ Troubleshooting

Common issues and solutions:

  • CDC replication lag: Check PostgreSQL WAL retention
  • RisingWave memory pressure: Adjust resource allocation
  • StarRocks loading failures: Verify network connectivity

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • PostgreSQL Community
  • RisingWave Team
  • StarRocks Community

📬 Contact

For questions or feedback, please open an issue in the repository.

About

This repository provides a modular and easy-to-extend ETL pipeline that streams data from a PostgreSQL database into a StarRocks data warehouse using RisingWave as the real-time streaming computation layer.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published