- About
- Architecture
- Prerequisites
- Quick Start
- Detailed Setup Guide
- Data Flow
- Monitoring
- Troubleshooting
- Contributing
- License
This repository provides a robust and scalable ETL (Extract, Transform, Load) pipeline that enables real-time data streaming from PostgreSQL to StarRocks data warehouse using RisingWave as the stream processing engine. The solution is designed to be modular, easy to extend, and production-ready.
PostgreSQL (Source) → RisingWave (Stream Processing) → StarRocks (Data Warehouse)
- PostgreSQL: Source database with transaction data
- RisingWave: Real-time stream processing and transformation layer
- StarRocks: High-performance analytical data warehouse
- Docker and Docker Compose
- PostgreSQL 13+
- RisingWave 1.0+
- StarRocks 3.0+
- Minimum 8GB RAM
- 20GB available disk space
- Clone the repository:
git clone https://github.com/yourusername/etl-postgres-to-starrocks-via-risingwave.git
cd etl-postgres-to-starrocks-via-risingwave
- Start the infrastructure:
docker-compose up -d
- Initialize the databases:
# Initialize PostgreSQL schema and sample data
psql -h localhost -U postgres -d postgres -f sql/init.sql
# Initialize StarRocks schema
mysql -h localhost -P 9030 -u root -f sql/starrocks.sql
# Configure RisingWave pipeline
psql -h localhost -P 4566 -u root -f sql/risingwave.sql
The PostgreSQL database is configured with:
- Sample tables (customers, products, orders, order_details)
- CDC (Change Data Capture) enabled
- Replication slots for RisingWave connectivity
- Sample data for testing
Refer to sql/init.sql
for the complete database schema and initial data setup.
RisingWave is configured to:
- Capture CDC events from PostgreSQL
- Transform data through materialized views
- Stream processed data to StarRocks
- Handle data type conversions and transformations
Key configurations in sql/risingwave.sql
include:
- PostgreSQL CDC source configuration
- Materialized views for data transformation
- StarRocks sink configuration
StarRocks is set up with:
- Optimized table schemas for analytical queries
- Proper data distribution and bucketing
- Automated data loading from RisingWave
See sql/starrocks.sql
for detailed warehouse configuration.
- Data changes in PostgreSQL are captured via CDC
- RisingWave processes these changes in real-time
- Transformed data is continuously loaded into StarRocks
- StarRocks maintains optimized storage for analytical queries
- PostgreSQL metrics via pg_stat_statements
- RisingWave dashboard at http://localhost:5691
- StarRocks FE visualization at http://localhost:8030
- Prometheus metrics at http://localhost:9090
Common issues and solutions:
- CDC replication lag: Check PostgreSQL WAL retention
- RisingWave memory pressure: Adjust resource allocation
- StarRocks loading failures: Verify network connectivity
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- PostgreSQL Community
- RisingWave Team
- StarRocks Community
For questions or feedback, please open an issue in the repository.