Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ndraws argument to tidy_draws #144

Open
2 of 3 tasks
mjskay opened this issue Dec 21, 2018 · 19 comments
Open
2 of 3 tasks

Add ndraws argument to tidy_draws #144

mjskay opened this issue Dec 21, 2018 · 19 comments

Comments

@mjskay
Copy link
Owner

mjskay commented Dec 21, 2018

Add ndraws, seed to:

  • tidy_draws (a bit of a pain because for it to be useful I would want an implementation specific to each model type; without optimizing it it isn't worth it since you can just do tidy_draws %>% sample_draws() anyway).
  • spread_draws
  • gather_draws
@mjskay
Copy link
Owner Author

mjskay commented Dec 25, 2018

or maybe simpler is to create a function for this (sample_draws and/or thin_draws)

@mjskay
Copy link
Owner Author

mjskay commented Mar 9, 2019

Closing this by deferring to the addition of sample_draws as the right approach. Now that I've implemented it, it seems clear that the add-a-function-to-pipeline approach is right one here. That is much more in keeping with modularity/composability, and prevents the introduction of other annoying API issues in the future.

@mjskay mjskay closed this as completed Mar 9, 2019
@mikedecr
Copy link

Hi Matthew, it looks like I'm chasing down something similar to what you were considering here, but I'm a little confused by your notes about how best to build what I want.

What I was trying to do: I have a Stan model that I want to (ideally) do something like spread_draws(stanfit_object, ..., n = 100) to make the process of spreading less time-consuming (since it's a big model).

It looks like you have already implemented sample_draws but it takes an already-tidy data frame of samples as data. Judging from other investigation of the codebase and from your notes here, is it correct that if I want to do this, I should build my own pipeline of that goes tidy_draws, and then sampling however many draws I want, and then spread_draws? I had this initial hunch that tidying the whole model would be expensive, but reading the src for these functions makes it look like spread_draws calls tidy_draws anyway and so it isn't any more expensive than if there were an n argument in spread_draws? Am I understanding this right?

Happy to clarify if I'm not making sense. Thanks!

@mjskay
Copy link
Owner Author

mjskay commented Sep 25, 2019

Yeah, good question. I would reconsider adding an n argument to spread_draws for exactly this kind of situation --- doing it later in the pipeline could be more costly for large models. If tidy_draws accepted an n argument it could do the subset before the tidying and this might make spread_draws faster.

This also goes along with #82, which would allow you to easily implement the equivalent of tidy_draws however you see fit (including using a subset of the draws) and then use spread_draws on the resulting data frame directly. In the meantime a hackish solution would be to give your own implementation of tidy_draws.data.frame() that just returns the data frame you pass it. That would be enough to "solve" #82 (without having additional checks for correctness one would want in a full solution). Does that make sense?

There's a (much) longer term idea here in having the tidybayes functions do some kind of internal query optimization so users don't have to deal with these things, but I only have so much time, so exposing an n argument might be the better short term solution :). That being the case I'll reopen this.

@mjskay mjskay reopened this Sep 25, 2019
@mikedecr
Copy link

I'm a little unclear on the original impetus for #82 so let me see if I'm understanding. Since query optimization is (rightly) off the table, a next-best place to intervene on the pipeline to spread only a sample of draws (in the near-term) would be to pass a trimmed down MCMC matrix to tidy_draws, but that won't work as currently written because tidy_draws wants to check metadata about the original model that would clash with a trimmed-down MCMC matrix? So one could just write a home-use version of it that does less checking against the original model object?

@mjskay
Copy link
Owner Author

mjskay commented Sep 27, 2019

Right, so if you wrote a really simple function like this:

tidy_draws.data.frame = function(model) model

That would allow you to pass arbitrary data frames directly into spread_draws. However, it won't verify that the data frame is in the correct format (which a good implementation would actually do), and if it isn't, spread_draws() will almost definitely choke somewhere. But if you constructed your own tidy data frame of draws that had a .chain, .iteration, and .draw column where each row is one draw (i.e. no draws are spread over multiple rows) and every other column was a parameter from the model, I think you should then be able to pass that tidy data frame directly into spread_draws() as long as you've defined the above function. Note: I haven't tested this.

The plan for #82 is to basically just write a better version of the above function that (at the very least) verifies the appropriate columns are present and of the right types. I might not do any other verification unless I can keep it fast.

@mjskay
Copy link
Owner Author

mjskay commented Sep 27, 2019

Update: I verified that the above definition will allow you to pass tidy data frames of draws directly into spread_draws() (again with the caveat that it can break if you don't make sure the input format is correct).

@mjskay
Copy link
Owner Author

mjskay commented Sep 27, 2019

Update: I've added an implementation of tidy_draws() for data frames (so you don't have to do it yourself) and (probably more useful) I've added n and seed arguments to spread_draws() and gather_draws(). Let me know if that gives the speedup you need.

If that doesn't work, I'll probably need to update tidy_draws() to support n and seed, which will be a bit more work because it will need model-type-specific implementations to really be useful.

@mikedecr
Copy link

Neat! This definitely makes the interface easier. Unfortunately though I am not experiencing much speed-up now since I guess a lot of the bottleneck is probably happening during tidy_draws, which appears to be tidying all of the MCMC draws still. I wish I knew more of the underlying OOP going on here with the original tidy_draws so I could be more helpful on the dev side. In the meantime I'll put Advanced R back on the to-do list.

@mjskay
Copy link
Owner Author

mjskay commented Sep 27, 2019

Sounds good! :)

I'm curious how slow just calling tidy_draws() directly is for you compared to calling spread_draws() on (say) one parameter? That would help me know if the slowdown is in tidy_draws() or spread_draws(). Depending on the model and the complexity of the spec in spread_draws() it could actually be due to either function.

@mjskay
Copy link
Owner Author

mjskay commented Sep 30, 2019

Incidentally, I just rewrote some pieces of the internals of spread_draws(), and it should be much faster on typical specs now than it used to be. @mikedecr if you have a chance to try the latest github version (via devtools::install_github("mjskay/tidybayes") I'd love to know if that helped at all.

Otherwise, if you let me know what sort of spread_draws() spec you are writing (even just the arguments you are passing so I have an idea of the dimensions of the parameters involved) it would make it so that I could profile the code and better identify where the slowdown is.

@mikedecr
Copy link

mikedecr commented Oct 1, 2019

This sounds really interesting! I can test drive this tomorrow and give you an example of the kind of thing I'm doing. Thanks!

@mikedecr
Copy link

mikedecr commented Oct 2, 2019

ok sorry for the delay. Here's some info about what's going on. I have an IRT-type model 7,500 draws total of just over about 9,000 parameters/transformed params/generated q's in total.

  • tidy_draws on the whole model takes about 25 seconds.
  • Spreading just the item parameters on the full model (for instance, spread_draws(stanfit, cutpoint[item], dispersion[item]) takes about 25 seconds also.
  • Spreading the item parameters with the argument n = 1 takes about 22 seconds. The 3-second decline is pretty consistent after trying it a few different times.

Curious to know what you glean out of this!

@mjskay
Copy link
Owner Author

mjskay commented Oct 2, 2019

Ah great, that's really helpful! The fact that spread_draws and tidy_draws take a similar amount of time (and this is independent of sample size) tells me the issue is likely due to an operation whose time is proportional to the number of parameters and not the number of samples, which is really useful to know. Also, given the model structure you've described I should be able to put together a benchmark case that I can use to replicate the slowdown, which I can then use to profile the code and see where specifically the problem is, then fix it.

Thanks so much for getting back to me on this, this is really helpful information!

@mjskay
Copy link
Owner Author

mjskay commented Oct 2, 2019

Followup question: how many unique values of item are there here: spread_draws(stanfit, cutpoint[item], dispersion[item])?

And secondary followup question: how fast is spread_draws(stanfit, cbind(cutpoint, dispersion)[item])?

@mikedecr
Copy link

mikedecr commented Oct 2, 2019

Sorry I meant to specify: 40 items. Using cbind doesn't really make a difference for me here, as far as I can tell.

@mjskay
Copy link
Owner Author

mjskay commented Oct 3, 2019

Followup: I think I've narrowed this problem to tidy_draws(), but if you have time for a quick check on your end it might help me be sure (and could give you an okay workaround in the meantime). Can you try:

  1. Install dev version of tidybayes:

    devtools::install_github("mjskay/tidybayes")
  2. In a fresh R session, extract the draws first:

    draws = tidy_draws(stanfit)
  3. Then spread using the extracted draws, not the original stanfit object:

    spread_draws(draws, cutpoint[item], dispersion[item])

Based on my testing, I would expect (2) to be slow but (3) not to be so long as you have tidied the draws first, as above. Is that the case for you?

The problem is that the code for converting samples from the format used by the model into a tidy data frame is slow. The dev version of tidybayes is able to avoid doing that if you've already done the conversion (as in step 2), so if you do it once up front you shouldn't have to do it again in any subsequent spread_draws calls. Thus the subsequent calls will be fast.

Fortunately I should be able to make step (2) fast as well, so eventually this all shouldn't matter, but I wanted to make sure I've correctly found the problem.

@mikedecr
Copy link

Belatedly, yes I can confirm that spreading is waaay faster with the tidy draws saved ahead of time at this, even when I take a large number of samples. If the tidying step is something I should have been doing this whole time even before these changes, I apologize for the oversight on my part! It seems like it's a game changer going forward though, so thanks a lot!

@mjskay
Copy link
Owner Author

mjskay commented Oct 10, 2019

Belatedly, yes I can confirm that spreading is waaay faster with the tidy draws saved ahead of time at this, even when I take a large number of samples.

Great! Thanks for confirming this.

If the tidying step is something I should have been doing this whole time even before these changes, I apologize for the oversight on my part!

Not at all --- in fact until I did some optimization based on your feedback in this issue it would not have worked at all, so thank you for raising the problem!

@mjskay mjskay changed the title Add n argument to tidy_draws Add ndraws argument to tidy_draws Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants