-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ndraws argument to tidy_draws #144
Comments
or maybe simpler is to create a function for this ( |
Closing this by deferring to the addition of |
Hi Matthew, it looks like I'm chasing down something similar to what you were considering here, but I'm a little confused by your notes about how best to build what I want. What I was trying to do: I have a Stan model that I want to (ideally) do something like It looks like you have already implemented Happy to clarify if I'm not making sense. Thanks! |
Yeah, good question. I would reconsider adding an This also goes along with #82, which would allow you to easily implement the equivalent of There's a (much) longer term idea here in having the tidybayes functions do some kind of internal query optimization so users don't have to deal with these things, but I only have so much time, so exposing an |
I'm a little unclear on the original impetus for #82 so let me see if I'm understanding. Since query optimization is (rightly) off the table, a next-best place to intervene on the pipeline to spread only a sample of draws (in the near-term) would be to pass a trimmed down MCMC matrix to |
Right, so if you wrote a really simple function like this:
That would allow you to pass arbitrary data frames directly into The plan for #82 is to basically just write a better version of the above function that (at the very least) verifies the appropriate columns are present and of the right types. I might not do any other verification unless I can keep it fast. |
Update: I verified that the above definition will allow you to pass tidy data frames of draws directly into |
Update: I've added an implementation of If that doesn't work, I'll probably need to update |
Neat! This definitely makes the interface easier. Unfortunately though I am not experiencing much speed-up now since I guess a lot of the bottleneck is probably happening during |
Sounds good! :) I'm curious how slow just calling |
Incidentally, I just rewrote some pieces of the internals of Otherwise, if you let me know what sort of |
This sounds really interesting! I can test drive this tomorrow and give you an example of the kind of thing I'm doing. Thanks! |
ok sorry for the delay. Here's some info about what's going on. I have an IRT-type model 7,500 draws total of just over about 9,000 parameters/transformed params/generated q's in total.
Curious to know what you glean out of this! |
Ah great, that's really helpful! The fact that spread_draws and tidy_draws take a similar amount of time (and this is independent of sample size) tells me the issue is likely due to an operation whose time is proportional to the number of parameters and not the number of samples, which is really useful to know. Also, given the model structure you've described I should be able to put together a benchmark case that I can use to replicate the slowdown, which I can then use to profile the code and see where specifically the problem is, then fix it. Thanks so much for getting back to me on this, this is really helpful information! |
Followup question: how many unique values of And secondary followup question: how fast is |
Sorry I meant to specify: 40 items. Using |
Followup: I think I've narrowed this problem to
Based on my testing, I would expect (2) to be slow but (3) not to be so long as you have tidied the draws first, as above. Is that the case for you? The problem is that the code for converting samples from the format used by the model into a tidy data frame is slow. The dev version of tidybayes is able to avoid doing that if you've already done the conversion (as in step 2), so if you do it once up front you shouldn't have to do it again in any subsequent spread_draws calls. Thus the subsequent calls will be fast. Fortunately I should be able to make step (2) fast as well, so eventually this all shouldn't matter, but I wanted to make sure I've correctly found the problem. |
Belatedly, yes I can confirm that spreading is waaay faster with the tidy draws saved ahead of time at this, even when I take a large number of samples. If the tidying step is something I should have been doing this whole time even before these changes, I apologize for the oversight on my part! It seems like it's a game changer going forward though, so thanks a lot! |
Great! Thanks for confirming this.
Not at all --- in fact until I did some optimization based on your feedback in this issue it would not have worked at all, so thank you for raising the problem! |
Add
ndraws
,seed
to:tidy_draws %>% sample_draws()
anyway).The text was updated successfully, but these errors were encountered: