Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

research: compatibility with citus #577

Open
VoVAllen opened this issue Aug 30, 2024 · 5 comments · May be fixed by tensorchord/pgvecto.rs-docs#90
Open

research: compatibility with citus #577

VoVAllen opened this issue Aug 30, 2024 · 5 comments · May be fixed by tensorchord/pgvecto.rs-docs#90
Assignees

Comments

@VoVAllen
Copy link
Member

VoVAllen commented Aug 30, 2024

Research if we could support citus

@gaocegege
Copy link
Member

cc @cho-thinkfree-com

@shetao2015
Copy link

+1

@cutecutecat
Copy link
Member

cutecutecat commented Sep 3, 2024

Stream/Physical replication

Current situation

The main problem is at building the index, when index log is replayed inside PG of standby instance:

postgres=# select * from pg_indexes where tablename = 't';
 schemaname | tablename | indexname | tablespace |                                               indexdef                                                
------------+-----------+-----------+------------+-------------------------------------------------------------------------------------------------------
 public     | t         | t_val_idx |            | CREATE INDEX t_val_idx ON public.t USING vectors (val vector_l2_ops) WITH (options='[indexing.hnsw]')
(1 row)

but the backend hook ambuild is not triggered at WAL replay of standby instance. So they are no longer in sync, which lead to this error at any vector query:

postgres=# SELECT idx_status, idx_indexing FROM pg_vector_index_stat;
ERROR:  pgvecto.rs: The index is not existing in the background worker.
ADVICE: Drop or rebuild the index.

Implement

We need to write a custom WAL manager like this PR.

These operations should be logged:

  • Index create
  • Index edit
  • Index drop
graph LR
A[ambuild<br>main] --> B[XLogBeginInsert<br>main]
B[XLogBeginInsert<br>main] -->|auto| C[Write custom WAL<br>main]
C[Write custom WAL<br>main] -->|publish| D[standby]
E[standby] -->|hook| F[decode & replay<br>standby]
F[decode & replay<br>standby] --> G[rpc.create<br>standby]
Loading

ref:

Limitation

  • Write-ahead log (WAL) is supported only on Postgres 15+ due to the fact that custom WAL managers were introduced in Postgres 15

@cutecutecat
Copy link
Member

cutecutecat commented Sep 3, 2024

Citus

🎉Citus is basically compatible with PGVecto.rs at the verification.

graph LR
1[items] --> 2[items_102016]
1[items] --> 3[items_102017]
1[items] --> 4[items_102018]
1[items] --> 5[items_102019]
2[items_102016] --> 6[items_embedding_idx_102016]
3[items_102017] --> 7[items_embedding_idx_102017]
4[items_102018] --> 8[items_embedding_idx_102018]
5[items_102019] --> 9[items_embedding_idx_102019]
6[items_embedding_idx_102016] --> 10[items_embedding_idx]
7[items_embedding_idx_102017] --> 10[items_embedding_idx]
8[items_embedding_idx_102018] --> 10[items_embedding_idx]
9[items_embedding_idx_102019] --> 10[items_embedding_idx]
Loading

Checklist

  • Index create and query of vector are succeeded
  • Indexes are distributed to all subtables(EXPLAIN query plan)
postgres=# EXPLAIN SELECT id FROM items ORDER BY embedding <-> '[1,2,3]' LIMIT 1;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=500.00..500.00 rows=1 width=12)
   ->  Sort  (cost=500.00..750.00 rows=100000 width=12)
         Sort Key: remote_scan.worker_column_2
         ->  Custom Scan (Citus Adaptive)  (cost=0.00..0.00 rows=100000 width=12)
               Task Count: 4
               Tasks Shown: All
               ->  Task
                     Node: host=localhost port=5432 dbname=postgres
                     ->  Limit  (cost=0.00..0.02 rows=1 width=12)
                           ->  Index Scan using items_embedding_idx_102016 on items_102016 items  (cost=0.00..502.74 rows=23979 width=12)
                                 Order By: (embedding <-> '[1, 2, 3]'::vector)
               ->  Task
                     Node: host=localhost port=5432 dbname=postgres
                     ->  Limit  (cost=0.00..0.02 rows=1 width=12)
                           ->  Index Scan using items_embedding_idx_102017 on items_102017 items  (cost=0.00..546.92 rows=26074 width=12)
                                 Order By: (embedding <-> '[1, 2, 3]'::vector)
               ->  Task
                     Node: host=localhost port=5432 dbname=postgres
                     ->  Limit  (cost=0.00..0.02 rows=1 width=12)
                           ->  Index Scan using items_embedding_idx_102018 on items_102018 items  (cost=0.00..462.53 rows=22042 width=12)
                                 Order By: (embedding <-> '[1, 2, 3]'::vector)
               ->  Task
                     Node: host=localhost port=5432 dbname=postgres
                     ->  Limit  (cost=0.00..0.02 rows=1 width=12)
                           ->  Index Scan using items_embedding_idx_102019 on items_102019 items  (cost=0.00..584.81 rows=27905 width=12)
                                 Order By: (embedding <-> '[1, 2, 3]'::vector)
(26 rows)
  • Indexes correspond with vectors of each subtable
root@7b2d6e048a03:/# cat /var/lib/postgresql/data/pg_vectors/startup/0 ; echo
{"indexes":[{"tenant_id":0,"cluster_id":7410314701760323622,"database_id":5,"index_id":17744},{"tenant_id":0,"cluster_id":7410314701760323622,"database_id":5,"index_id":17746},{"tenant_id":0,"cluster_id":7410314701760323622,"database_id":5,"index_id":17745},{"tenant_id":0,"cluster_id":7410314701760323622,"database_id":5,"index_id":17742},{"tenant_id":0,"cluster_id":7410314701760323622,"database_id":5,"index_id":17743}]}

root@7b2d6e048a03:/# cat /var/lib/postgresql/data/pg_vectors/indexes/0000000000000000000000000000000066d6b9f5b04760260000000500004552/sealed_segments/1/storage/len ; echo
27899

root@7b2d6e048a03:/# cat /var/lib/postgresql/data/pg_vectors/indexes/0000000000000000000000000000000066d6b9f5b04760260000000500004551/sealed_segments/1/storage/len ; echo
22061
  • No loss in vector query accuracy
    recall = 0.8504 at glove+cos+hnsw regardless of citus on/off

  • Citus feature: single node

  • Citus feature: multi node

postgres=# EXPLAIN SELECT id FROM items ORDER BY embedding <=> '[1,1,1]' LIMIT 100;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3821.93..3822.18 rows=100 width=12)
   ->  Sort  (cost=3821.93..4071.93 rows=100000 width=12)
         Sort Key: remote_scan.worker_column_2
         ->  Custom Scan (Citus Adaptive)  (cost=0.00..0.00 rows=100000 width=12)
               Task Count: 4
               Tasks Shown: All
               ->  Task
                     Node: host=172.18.0.1 port=5430 dbname=postgres
                     ->  Limit  (cost=0.00..2.10 rows=100 width=12)
                           ->  Index Scan using items_embedding_idx_102028 on items_102028 items  (cost=0.00..501.98 rows=23918 width=12)
                                 Order By: (embedding <=> '[1, 1, 1]'::vector)
               ->  Task
                     Node: host=172.18.0.1 port=5431 dbname=postgres
                     ->  Limit  (cost=0.00..2.10 rows=100 width=12)
                           ->  Index Scan using items_embedding_idx_102029 on items_102029 items  (cost=0.00..545.11 rows=26009 width=12)
                                 Order By: (embedding <=> '[1, 1, 1]'::vector)
               ->  Task
                     Node: host=172.18.0.1 port=5430 dbname=postgres
                     ->  Limit  (cost=0.00..2.10 rows=100 width=12)
                           ->  Index Scan using items_embedding_idx_102030 on items_102030 items  (cost=0.00..464.84 rows=22147 width=12)
                                 Order By: (embedding <=> '[1, 1, 1]'::vector)
               ->  Task
                     Node: host=172.18.0.1 port=5431 dbname=postgres
                     ->  Limit  (cost=0.00..2.10 rows=100 width=12)
                           ->  Index Scan using items_embedding_idx_102031 on items_102031 items  (cost=0.00..585.08 rows=27926 width=12)
                                 Order By: (embedding <=> '[1, 1, 1]'::vector)
(26 rows)

Limitation

  • The result of SELECT * FROM pg_vector_index_stat; is not correct
 tablerelid | indexrelid | tablename |      indexname      | idx_status | idx_indexing | idx_tuples | idx_sealed | idx_growing | idx_write | idx_size |                                                                  idx_options                                                                   
------------+------------+-----------+---------------------+------------+--------------+------------+------------+-------------+-----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------
      17652 |      17689 | items     | items_embedding_idx | NORMAL     | f            |          0 | {}         | {}          |         0 |        0 | {"vector":{"dimensions":3,"vector":"Vecf32","distance":"L2"},"indexing":{"hnsw":{"m":12,"ef_construction":300,"quantization":{"trivial":{}}}}}
(1 row)

Without citus, it shoule be:

 tablerelid | indexrelid | tablename |      indexname      | idx_status | idx_indexing | idx_tuples | idx_sealed | idx_growing | idx_write | idx_size |                                                                  idx_options                                                                   
------------+------------+-----------+---------------------+------------+--------------+------------+------------+-------------+-----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------
      17695 |      17703 | items     | items_embedding_idx | NORMAL     | f            |     100000 | {100000}   | {}          |         0 | 22909429 | {"vector":{"dimensions":3,"vector":"Vecf32","distance":"L2"},"indexing":{"hnsw":{"m":12,"ef_construction":300,"quantization":{"trivial":{}}}}}
(1 row)

@gaocegege
Copy link
Member

Thanks for your research!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants