-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: bson timestamp type casting #2593
Conversation
statement ok | ||
copy timestamps to '${TMP}/timestamp_test_out.bson'; | ||
|
||
# bson's date format is an int64, of milliseconds since unix epoch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't there the '\x09'
"UTC datetime - The int64 is UTC milliseconds since the Unix epoch."
which isn't part of the internal mongodb spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, exactly, that's what we're using. We just convert all arrow Date
and Timestamp
values into these int64 (millis since epoch), but when we read bson data back into GlareDB all of those values become Date64
; not back to timestamps.
> create table timestamps (t timestamp);
Table created
> insert into timestamps values (arrow_cast(946684860000000, 'Timestamp(Microsecond, None)'));
Inserted 1 row
>
>
> select * from timestamps;
┌─────────────────────┐
│ t │
│ ── │
│ Timestamp<µs, UTC> │
╞═════════════════════╡
│ 2000-01-01T00:01:00 │
└─────────────────────┘
> copy timestamps to 'timestamps.bson';
Copy success
> select * from timestamps;
┌─────────────────────┐
│ t │
│ ── │
│ Timestamp<µs, UTC> │
╞═════════════════════╡
│ 2000-01-01T00:01:00 │
└─────────────────────┘
> select * from 'timestamps.bson';
┌─────────────────────┐
│ t │
│ ── │
│ Date64 │
╞═════════════════════╡
│ 2000-01-01T00:01:00 │
└─────────────────────┘
It works as intended but sqllogictest
renders Date64
as int64
s.
When we add support for declaring a schema rather than inferring it, we'll be able to cast these bson dates to whatever we need (with only the expected loss of fidelity and resolution), and the code mostly supports this today.
which isn't part of the internal mongodb spec
The internal one is \x11
which they call Timestamp
(it's an int32 of seconds-since-epoch, and a int32 increment, but it's used as a logical/monotonic clock, but a sort of weird one to keep operations in the replication log (oplog
) in order.) BSON libraries implement but no one should use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Id like to keep #2590 open as we still technically can't roundtrip those types. Since we aren't notifying the user, or have it documented anywhere that you can't roundtrip these types, I think the issue is still valid.
I understand the desire for precision, but the same issue impacts JSON and CSV export (only in even more cases, as they have fewer types,) and more don't round trip. Parquet, even, has a less sophisticated type system than arrow (and therefore glaredb) and would probably lose some type fidelity in the round trip. We can keep a general issue or epic open for "round trip types" as a placeholder for these kinds of issues, but the solutions are all a bit imperfect, so I doubt we'd ever be able to truly resolve it. Having a placeholder might be useful, or it might just confuse things even more. There's also nothing actionable in the glaredb code base to do for this, about BSON right now: we can't create new types in the upstream format. We could document this, but we should track that in a documentation issue. This would be a good page to have at some point. We could consider implementing or providing a "strict" mode (or more likely various strict-modes), with regards to type conversions where we could propagate errors for any non-roundtrip-able types, but that's a different issue, and would impact most data sources. I think that's a different issue. Being able to declare the schema rather than inferring it from the external type-system, provides us the opportunity to up-cast values, and perhaps change some of the down-casting semantics to support something that more closely approximates round tripping. This is something that we'd like to do and can work on, in the near term, and I think has the greatest effort-to-impact ratio, but it's also a different issue. |
Closes #2590
Closes #2585