Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major refactor: extract datamodel package. #228

Merged
merged 8 commits into from
Aug 19, 2021
Merged

Major refactor: extract datamodel package. #228

merged 8 commits into from
Aug 19, 2021

Conversation

warpfork
Copy link
Collaborator

@warpfork warpfork commented Aug 18, 2021

Goal: make the root package of this repo into a place where we can put
helper methods and powerful features (which depend on other packages!),
rather than having it hamstrung by being the very bottom of the package
dependency graph, and thus forcing all the interesting features and
useful synthesis functions into difficult-to-discover subpackages.

This in turn should make a lot of UX improvements possible.

Also, extracting a bunch of the key interfaces to a package called datamodel has a result that just frankly reads well. I was surprised how instantly I liked it. Probably a good sign.


First of all, I want to apologize for the monster diff.

Second of all, I want you to know, reader, that all this came from the desire for a ToString method.

Yes, I wanted to printf-debug. And yes... it lead me to do... this.

I jest. But also tell the truth. But also, honestly think this has become a critically important refactor
in order to unblock new features, unblock fixing some nasty UX bugs, radically increase usability and discoverability,
and ultimately, actually make the whole system easier to gracefully evolve with fewer breaking changes in the long run.

How did I get here? How will this really cause so many victories? Read on.


ToString

I'm not kidding about the ToString thing. This example will actually be instructive.

I was toodling along trying to write some code using this library.
As is not uncommon, I wanted to "printf" some data, because in the heat of hammering out new code,
I just wanted to see what the data was doing.

Ow. This is actually not a feature we have a good function for right now.
One of the best things we can do is probably pipe the data in question into a json codec, and print that (which totally works, but... you know, uufdah).
And I actually wanted more than that: I was working with schema'd data, and the number-one, most key thing I wanted in my printf debug was... the type info.

I think to myself: okay, this is not the first time I've had this want.
It's certainly not going to be the last time.
Other people using IPLD and these libraries are going to want this too.
And when we do write a function for this, it really ought to be darned easy to discover.

So I start trying to write that function.

I think maybe I'll try to make it a package-scope function, and put it in the root package.
This seems right, right? It should be applicable on any Node (we want it to be able to do a debug render of anything, not have to hope node implementations provide a good printf implementation of themselves), so that means it's a function, not a method.
And putting it in the root package seems desirable, because, well, discoverability.

And then I stub my toes, repeatedly, instantly.

  • I wanted to check if the node implements schema.TypedNode, so I can simply print its type name.
    • D'oh. Right. Can't: the root ipld package can't import schema because A) it's not supposed to, for logical scope reasons, and B) it'd be an import cycle.
    • Okay, well, skip that... It's already a primary objective failure, but whatever...
  • Can I check if it's a representation node for some typed data?
    • D'oh. No. Same reasons.
    • Okay, forget everything about the schema layer.
  • Some of this data, it's got strings. I'm going to need to quote it.
    • Okay, cool, we've got functions for that already. Say, in the json package.
    • D'oh. No. Can't: the root ipld package can't import a codec. Cycles! Also, good lord, the bloat. Having Node depend on json is a critical fail; absolutely not.

Okay shoot. So, the package graph here has rapidly squeezed me into:

  • Not being able to provide the features I want, period;
  • and making me probably re-implement some fairly subtle stuff;
  • not would re-implementing stuff be a pain and a probable bug-farm, it'd make us ironically fail at having a nicely trimmed package tree size anyway;
  • and dang, I am just not happy about it.

If this was the only time this had happened, I'd say maybe "meh" and let's hack out a kludge for this case, and move on.
It is not the only time this has happened. We have repeatedly hit toe-destroyers like this when wanting to provide basic features.
We sometimes dodge it by adding the feature in another deeper package somewhere, to break the cycle problems,
but even then, it's often a weakened solution, and other users of the library don't discover it quickly, either, because it's tucked away somewhere deep.

So what happens if we slide all the key interfaces currently in the root package over to a new package, for datamodel...
and then add features like ToString to the now-vacated root package, and let them go wild importing datamodel, and schema, maybe even some string escaping utils from json for the joy of handy debugging tools?
Does that work? Does it help?

It does, and it does.


Another example feature: Copy

I'd like to offer a Copy operation. (Yes, nevermind that most of this library is oriented around immutability, so that sounds pointless. Somehow, it still comes up. In particular, when you want to copy some datamodel content into a different node implementation, for memory layout reasons. But we digress; that's unnecessary detail.)

This doesn't hit exactly the same problem that ToString did -- it doesn't end up wanting to refer to any features beyond essential data model stuff.
It hits on another problem instead: more of a semantic confusion one.

  • I want to offer a Copy operation that does feature detection: if it sees a node has some fastpath for this, and use it if so, and fall back to a data-model copy walk if not. (The fastpath here is basically NodeAssembler.AssignNode already, but nevermind that.)
  • I also want to offer a Copy operation that's just the fallback: it only does the dumb walk-and-copy, and never uses AssignNode. Why? Because there's a lot of AssignNode implementations out there that have to do this fallback. (It's in codegen outputs a lot.) We should have that function somewhere reusable.

So this isn't a hard blocker. We could just name them different things.

But what do we call them?
And what are the odds of a user making a mistake, if they're right next to each other?
And how bad are the consequences of that mistake? (Spoiler: kinda bad. You probably get an infinite recursion bug if you pick the wrong one as your util to use inside your AssignNode implementation.)

Now imagine how we need Copy(dest NodePrototype, src Node) (Node, error) and CopyIntoAssembler(dest NodeAssembler, src Node) error and other variants like that,
all adding more clutter to the namespace too, for the sake of offering all the needed performance-conscious vs convenience options. (Not imaginary: we need these.)
Now what do you suppose are the odds of a user either making a mistake, or becoming just plain exhausted, trying to pick the "right" function?

Splitting out the datamodel package offered a good solution to this too.
The dumb walk-and-copy Copy function goes in the datamodel package;
the smart feature-detecty Copy function goes in the root ipld package, where we'll be adding all our other new smart feature-detecty functions.
Neat. Solved. Clear. Consistent.

(More notes coming about the feature-detecty stuff coming up in a few sections, too.)


Making things discoverable

People frequently ask for "helper" functions, or facades, a something do just "quick do x".

This usually goes the same way as the ToString story above:

  • I have to say "no", because something-something dependency cycle.
  • Or I have to say "no", because something-something appropriate transitive dependency tree sprawl.
  • Or I have to say "no", because node implementors would have to do too much work to support that, and it makes combinatoric explosions of methods, yada-yada.
  • Or at best, I get to say "sure, but it has to go in a package off to one side over here". And then we do it, but a week later, a new user probably asks me how to "quick do x" again anyway, because they didn't find that new helper method we just wrote, because it's tucked off in a corner somewhere, just to sate the package import graph gods. Sigh!

So let's stop having that problem.

The root of all of these problems is because we put the ipld.Node interface in our root package, and because everything depends on it, it's a huge albatross for putting anything else in the root package.

If we just move that to a subpackage -- here, now to the datamodel package -- everything goes peachy.

Now we can add whatever facades and helpers we want into the root package, and we can assume people will find them.
Imported the root package will have a big transitive dependency sprawl -- but we can advise people to use the specific subpackages if they care.
And also, we can pull all this off with minimal breakage to existing users, because we can spam aliases.

This seems like a win-win-win to me.


Evolving

Now about that biggest claim: that ultimately, in the long run, this is actually going to make
the whole system become easier to evolve, with fewer breaking changes.

Yep.

Okay, the magic for that isn't done in this commit. But I'm clearing the way for it.

The future is feature detection. The ToString function (again, key: function, not method) is an example of what I'd like to do more of.

Right now, we lean way too heavily on the Node interface.
It's a fairly monolithic interface.
It's a pain in the butt to implement all its methods, if you want to make a new one.
(And while that's supposed to be rare(ish), it certainly happens. Also: holy smokes, it just makes codegen emit huge hulks of stubs.)
But most problematically, we basically can't ever change it again at this point.
If we add anything, or change anything, it's a huge ecosystem-breaking change.
Anyone who's made an ADL? Broken. Anyone who's used codegen? Broken.
There's enough Node implementations in the wild already that we're completely ossified at this point.

(Fortunately, we seem to have hit pretty close to the mark in designing the Node interface, before ossification set in.
Still: nothing's perfect; and yes, we do still want to add features sometimes.)

So what should we do about this? Feature detection.

We will want to start writing more operations as package-scope functions, which take their subject Node as a parameter.
Inside those functions, we can do interface casts to see if the Node supports the operation, or, if it has any fastpaths for that operation (ooh!),
or, maybe it's something we can fall back to synthesizing out of one of the Node's other supported operations.

This will rock in several ways:

  • We can confidently and easily add new features. They're package scope functions. No work required from downstreams (aka, the Node interface doesn't grow in ways that break things, and the work for a new Node implementer isn't growing either).
  • We can use more and more feature detection to support "fastpaths". For example: we could add a KeysIterator function, and it can peek at the Node, and see if it has explicit support for that feature; if it does, we use that; if it doesn't, we use the general MapIterator, and it's probably slower (it might be allocating data we're just throwing away), but it'll work.
  • We can actually shrink the work required by Node implementations. You know how there's four LookupBy* methods on Node right now, for various performance reasons? What if we could offer all those, but also, if you were implementing node, you could just bother with one, and let our operations do feature-detection and synthesize the rest? We can do that now. We might actually be able to take methods out of the Node contract in the future.

I think this pattern of feature detecting operation functions will unlock a ton of usability improvements in short order.
And while we didn't technically have to reorganize the package graph to unlock this... honestly, I think it really helps.
I'm not sure I would've been smart enough to see this outcome without having embarked on the reorganization, and I suspect future users would've also had just as many discoverability hurdles. Fixing that is worth a lot.


Frankly, it just reads nice.

I was really happy reading datamodel.Node and datamodel.Kind all over the place.

It's what we say when we're describing these things to other humans in conversation, and often even in written prose.
It's good that the code now says the same things.


Some other targets-of-opportunity

errors

I moved a few error types around slightly during this work.
A couple things moved from datamodel to schema, because 100% of their references were over there.
A couple things disappeared, because it turned out there weren't even being used.
Oh, and ErrHashMismatch moved to linking, because that's the only place that used it.
These just happened to be especially easy to notice since I was already fixing package graphs.

I don't think any of these changes will be particularly disruptive;
frankly, our error types haven't been consistent enough yet that I think anyone in their right mind could be relying on them.

linking

I moved the LinkSystem type to the linking package.
This just seems like common sense.
I think a lot of package imports at the tops of files started immediately making sense when I did this.

There's still an alias to LinkSystem in the root package, so this shouldn't be disruptive.

codecs

I finally got rid of those dang Marshal and Unmarshal functions that exposed refmt API types from the codec package.

Turned out only one or two tests were using them, and they weren't that hard to update to use the modern Encode and Decode functions.
I don't think anyone else outside the repo has been using these for a good while already, so this shouldn't be disruptive.

basicnode

I moved node/basic to node/basicnode. Same naming convention as bindnode users. Means last segment of import path and actual package name are the same, which is a happier thing.

This is probably actually a pretty huge breakage from the POV of many downstreams.
I should either revert this bit, or we should add a whole facade package.
(Or we could decide to bite the bullet and power through it. But I don't see that as a likely choice.)


Phew

That's it, phew.

This proposal may seem a bit sudden. It is. I was going to do a spike to prototype it, and then ended up thinking "what the heck, let's just Do The Thing and see if it works" (frankly expecting to fail), and lo and behold, it did work. So now here with are with a cool thing, and the choices are: run with it, or, throw it away. (Maintaining it as a branch is not on the table.)

I suspect this might be a bit controversial, just because of the sheer diff size. That said: please don't trust your first instincts from that. Golang's alias features really give us a lot of power to make a change this big, and yet make it smoothly.

And yet, because it's big, I'd also like to try to land it fairly decisively. Almost any other work in this repo will generate merge conflicts with this now.

I'm pretty confident this is a good direction forward. All the tests are green already. Codegen's even updated. Let's go?

README.md Show resolved Hide resolved
codec/api.go Show resolved Hide resolved
codec/api.go Show resolved Hide resolved
codec/dagcbor/unmarshal.go Outdated Show resolved Hide resolved
@rvagg
Copy link
Member

rvagg commented Aug 19, 2021

Yeah, that basicnode change is going to hurt, but I like the clarity here, it's nice not having to rename basic and the removal of some __ is always nice.

Back to the top though:

Goal: make the root package of this repo into a place where we can put helper methods and powerful features

Can you justify this goal? Maybe there's some example code you could write, or point to that uses just the root package, or uses it for the majority of work that would make this a worthy goal? I know it's an internal view looking at the diff here but it really seems that the clarity from this change comes from the package separation, but now you also have many of those pieces hoisted up to the root, kind of undoing that clarity. You also have the mix of root vs specific package usage as I pointed out inline which suggests that maybe the distinction isn't even that clear or worthwhile? What would you lose if you removed the root package hoisting entirely? Are they just there now to fill in a blank space left after refactoring?

@warpfork
Copy link
Collaborator Author

The aliasing process is working extremely well.

I've added aliasing and stubs to make the node/basic->node/basicnode rename utterly seamless.

I've also added a few more aliases back into the root package, for error types, and some of the storage function types (namely, BlockWriteCommitter gets referred to by type name, even if the others are usually matched implicitly -- interesting, that), as I've found them needed if we want to make this a zero-impact change.

I'm testing that this is effective by attempting to run go-ipld-adl-hamt against this new version of the library, and it's... it's working. I've actually got go-ipld-adl-hamt tests passing with this refactored library, without any changes, and that's even before re-running its codegen(!).

Aliases are very very cool.

I think I might actually be able to land this, and tag it as a v0.12.0, and say "there are no breaking changes in this release".

@warpfork
Copy link
Collaborator Author

warpfork commented Aug 19, 2021

Goal: make the root package of this repo into a place where we can put helper methods and powerful features

Can you justify this goal?

No. :)

I'm just stating that as pretty much axiom.

(Oh, okay, fine...)

I've seen a lot of cases lately where someone asks for a feature that we already have, it's just hidden in a sub-package, and the discovery process just... doesn't happen, or gets lost in the maze. Or, that someone mistakes an internal API for one they should be using directly. Or someone want to add something to a very core API, but really what they're wanting is a convenience function, and it's not obvious that putting that in the most obvious place would be bad 🙃 I want to reduce the event frequency of all three of those things to near zero. I think "put the functions people want to use, and should see first, in the root package" is the way to do it.

but now you also have many of those pieces hoisted up to the root, kind of undoing that clarity.

True, for some of these values, it will be temporary.

For example: I'm going to be including aliases to a bunch of error types in this PR. They will be transitional. I think I may remove those in a near future release. (As you say, having something like those symbols able to be referred to as more than one name... seems actively bad for clarity.)

In some cases I think the alias can stay, and it's good docs. Node and LinkSystem are probably going to be examples of this. (We can debate these individually, later, too. So even getting that conversation enabled is kind of a win.)

Are [some of those aliases] just there now to fill in a blank space left after refactoring?

So, yes :)

I haven't filled in anything new in the root package yet. I'm going to hold off on that until proving this can fully land. But I want to, very very soon. I have a bunch of uncommitted work already that wants to go into this new space.

Goal: make the root package of this repo into a place where we can put
helper methods and powerful features (which depend on other packages!),
rather than having it hamstrung by being the very bottom of the package
dependency graph, and thus forcing all the interesting features and
useful synthesis functions into difficult-to-discover subpackages.

Also, the result just frankly reads well.
Evidently, it fits there without issue.

(The name stutter here is nonzero, but just calling it "Context" alone
seems... worse, given how much meaning that word has in golang.)
Many of these will be transitional only, I think,
but we'll see what the appropriate timescale is for that.
It may be fine to leave them for quite a while.

With these aliases, I've been able to substitute this branch in
to... approximately all of our downstreams I know about,
and all their tests still pass.  That includes packages that
have codegen outputs committed already (!).  That includes go-ipfs,
and thus includes all of the transitives between us and it.

I think that means we're truly able to claim this will be a
"no breaking changes" scenario when we merge this and
it next comes time to tag a release.

(Golang's alias feature is kind of incredible.  I'm really happy.)
We can do all our work without it, necessarily.  Let's do so.

This could have happened opportunistically, as it was forced, too --
in fact, that's what the prior commits were all doing --
and we'd just expect that to happen more and more as we add new
features and helper methods and glue into the root package.

However, it's just as easy (in fact, probably easier) to do it
proactively, all at once, right now.
@warpfork
Copy link
Collaborator Author

I chased the remaining uses of the root package within our repo down. You were right @rvagg , it instantly made things better.

Old graph:
graph-bad

New graph:
graph-good

The new one is definitely smoother.

These are based on manual review of the change in the godoc's list
of exported symbols.
(I haven't actually seen them used anywhere in downstreams in a way
that their omission is a breaking change, but there's always a chance.)

DeepEqual actually completely belongs here.

ADL... I'm dubious on.  I don't really think that should be in the
datamodel package, either, actually.  But let's roll with it.

ErrHashMismatch... well, all those other errors already coming along
for "a transitionary period", so we might as well apply that pattern
consistently, and include this one too.
Also, fix up its docs.  Phew, those had old and badly misleading links.
@warpfork
Copy link
Collaborator Author

I've started other work further downstream of this which is now starting to add new helper methods and easy-use stuff to the root package, and I'm still really happy with this. (Using those, in turn, in further downstream projects, has already taken a bunch of stuff that was previous five lines of annoying glue, and turned it into one-liners that read something like ipld.Unmarshal(reader, json.Decode, &thingy, typeSystem.TypeByName("Thingy")), etc, and its... yes. Yes this is how I want to feel when I use these things.)

And I'm also still confident this is going to not be a breaking change for downstreams. (Almost unbelievably.)

Gonna merge and keep rolling forward with this, and adding the new facades of my dreams. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants