Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] inspectable set-valued domains for distributions #244

Open
fkiraly opened this issue Apr 8, 2024 · 25 comments
Open

[ENH] inspectable set-valued domains for distributions #244

fkiraly opened this issue Apr 8, 2024 · 25 comments
Labels
API design API design & software architecture feature request New feature or request

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Apr 8, 2024

It would be useful if distributions had some degree of inspectability with respect to their domains (sets), use cases include:

  • approximation defaults, e.g., integer domain distributions know their domain, so expectations can be computed as sums
  • typing for distributions, e.g., discrete, continuous, mixed
  • inspecting the support of the discrete (deltas) part for mixed distributions, related to representing continuous and discrete parts separately: [ENH] design discussion - pdf and pmf in distributions, discrete, continuous, and mixed #229
  • generating useful defaults in plotting routines
  • testing distributions; sets may also be useful in specifying parameter ranges in estimators

Some discussion has already taken place here: VascoSch92/sequentium#46
also regarding possible ways to implement this.

Options discussed:

  • using scipy Sets, possibly also stats
  • de-novo implementation, following BaseObject
  • something similar to sklearn.utils parameter checking

Some issues from skpro architecture which may not be obvious how to cover:

  • distributions are "tabular" (matrix distributions with pandas-like row and column index). Domains may vary over entries of the table.
  • we may need parameteric sets, though that is not certain. Afaik only BaseObject supports parametric objects? Composites are supported by all three options above.
@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 8, 2024

FYI @VascoSch92

@VascoSch92
Copy link

I will start to work on a first version of a module for symbolic representation of sets.

The idea is to extend from BaseObject.

I still don't get 100% what it would be the application in skpro.distributions, but I will try to mimic the API of set6

I will open a draft PR as soon as I have something interesting. In this way we can discuss the code.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 13, 2024

Great!

So you think it's better to inherit from BaseObject than using the existing logic in sympy?

If I may ask, what are your pros/cons and weighting? Just curious.

@VascoSch92
Copy link

Actually I was playing a little bit with the set implementation of sympy and I have to admits that it is pretty nice.

One could think to use that and extend it to implement the measure of a set and integral computations.

However, adding sympy to the dependencies can be over over-killing. Do we need so much power for the purpose of the project? Can we just install the specific module which takes care of sets?

Perhaps, clarifying the exact API needed to the project could lead to a decision. From what I understand, the main purpose of this module is to computed pdf and pmf for a distribution. To this purpose, we just need to be able to represent subsets (discrete or not) of the real line (or of a finite dimensional real euclidean space). Correct?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 13, 2024

To this purpose, we just need to be able to represent subsets (discrete or not) of the real line (or of a finite dimensional real euclidean space). Correct?

Basically, yes - that's the key requirement.
Also, finite/discrete sets for distributions of arbitrary support, but that's basically already python set.

Yes, the "weighty dependency" argument is convincing. I'd agree it outweights the "do not reinvent the wheel" one, as it's going to be a small wheel (for now).

@VascoSch92
Copy link

Why we don't try to solve the problem directly using the integration provided by scipy ?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 15, 2024

I don't think that works for mixed distributions, i.e., mix of deltas and (abs) continuous? I'd still think you need some explicit representation of the discrete part.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 18, 2024

Should we try with scipy? Plus some custom logic where we get to mixed distr. Might be the option with the best trade-off?

@VascoSch92
Copy link

I don't think that works for mixed distributions, i.e., mix of deltas and (abs) continuous? I'd still think you need some explicit representation of the discrete part.

yes but we cannot separate the two parts?

Should we try with scipy? Plus some custom logic where we get to mixed distr. Might be the option with the best trade-off?

Yes

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 18, 2024

yes but we cannot separate the two parts?

What do you mean by that? Do you mean this as a suggestion, or a statement?

@VascoSch92
Copy link

Let's say you have a mixed distribution. Can we separate the two parts (dense and discrete), compute integrals on both and then put them together?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 19, 2024

yes, that is exactly my thinking. But for that, you'd need to represent pmf and pdf separately. For both, you'd need some representation of domain to set up the integration, which brings us to the topic of this issue.

@VascoSch92
Copy link

Can you give a concrete example of a mixed distribution you would like to implement?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 19, 2024

Sure, here are two:

  • clipped normal, i.e., a random variable of form $\max (c, X)$ for a normal random variable $X$ and constant $c$.
  • mixture of empirical and normal, this can occur when applying Mixture to Normal and Empirical.

@VascoSch92
Copy link

Sorry but i still have problem to understand the clipped normal example.

We have the max of two continuous functions, therefore should we have a continuous support for pdf and cdf not?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 22, 2024

We have the max of two continuous functions, therefore should we have a continuous support for pdf and cdf not?

Yes, the full support is continuous, but the distribution is mixed, so by the Lebesgue decomposition theorem we can decompose it in a non-trivial absolutely continuous part, a pure point part, and there's no singular part because those aren't distributions that we want to look at (😁)

The clipped normal has two supports, therefore:

  • the absolutely continuous part has support $[0, \infty)$, the measure of the absolutely continuous part on this is 1/2
  • the pure point part has support $\{ 0 \}$, the mass (measure) of the pure point part on 0 is 1/2

Some confusion can be coming from the word "continuous", which is overloaded, as it could be used as a property or qualifier for

  • distributions/measures, as shorthand for "absolutely continuous" (wrt Borel sigma algebra on R)
  • distribution defining functions, such as the cdf or pdf. These being continuous bears no relevance here - but due to the overloading it is frequently confused.
  • sets, when a support or domain - sometimes used for Borel open sets or their closure, to differentiate from countable unions of points

@VascoSch92
Copy link

Ah ok now is much clear. Sorry I'm not an expert in probability theory :-(

In practical, you want to compose the normal distribution and the mass measure at 0 to have the clipped normal. In the instance of the normal distribution you give the support and same for the mass measure, right? from that you can compute what you need.

ok 👍 now it is clear... i can start to try coding something and see if it fit the needs of the package

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 23, 2024

In the instance of the normal distribution you give the support and same for the mass measure, right? from that you can compute what you need.

Yes, exactly.

i can start to try coding something and see if it fit the needs of the package

What's your design, if I may ask?

I'd go with modifying BaseDistribution. You may also be interested in this refactor PR: #265

@VascoSch92
Copy link

VascoSch92 commented Apr 26, 2024

Basic idea: parent class Set which extend from BaseObject which is more or less an interface. Then two subclasses: one for intervalls (Interval) and one for discrete sets (Discrete (?)) as these are the two sets we need the most.

Then we have the following questions:

  • do we need union? direct product? intersection?
  • which method/properties we need? bounduary? mass/measure? interior?

We can expose in the BaseDistribution class the domain of the probability.

I will try to open a draft/sketch PR for feedbacks and guidance as soon as possible :-)

@fkiraly
Copy link
Collaborator Author

fkiraly commented Apr 26, 2024

We can expose in the BaseDistribution class the domain of the probability.

Agreed - we may have to distinguish domains for the discrete and the continuous part as well.

Then we have the following questions:

Regarding requirements:

  • direct product: direct product in the form of array distributions, possibly - let me explain
  • the skpro distributions are "array distributions", we probably need a comparable concept for sets. That makes it unusual, perhaps. E.g., take the Normal or Uniform from the example (Normal.create_test_instance), it has a 2D range. The Normal is supported over the reals, but the Uniform may have different support per entry. Or is this too much for the start, and overdesign?
  • union/intersection: I do not see where they would appear, but perhaps we want to think how to keep the design upwards compatible for this.
  • methods: the measure is given by the probability distribution whose domain is the set, so the set itself may not need to have a measure attached to it. I am thinking where this could be useful, or other things like boundary and interior, but can't see a clear use case.

@VascoSch92
Copy link

After a very first draft for the module domain (see branch #326 ), we are ready to start introducing domains for distributions.

We will work on the branch #326 until a stable API is found. After that, we will merge into main.

To find a valid and stable API, domains should be introduced for at least on of the following distributions:

  • discrete distributions with finite support
  • discrete distributions with infinite support
  • absolutely continuous distributions supported on a bounded interval
  • absolutely continuous distributions supported on intervals of length 2π (directional distributions)
  • absolutely continuous distributions supported on semi-infinite intervals ( e.g., [0,∞) )
  • absolutely continuous distributions supported on the whole real line
  • absolutely continuous distributions with variable support
  • mixed discrete/continuous distributions

With domains, we want also to introduce the 2 new methods:

  • pdf - probability density function
  • pmf - probability mass function

Questions:

  1. What is the expected API for pdf and pmf?
  2. Which new _tags should we introduce?
  3. Do we have at least one example already coded for every of the family of distributions above listed?

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 15, 2024

  1. What is the expected API for pdf and pmf?

The API is already specified - it has been introduced since 2.2.2, after you branched off. If you update from main, you should see the current specs in the docstring, in the BaseDistribution object.

  1. Which new _tags should we introduce?

That's a good question. I was thinking about working with properties and attributes primarily, but now that you mention it, we may consider tags as well. I have no clear answer to this yet, input appreciated.

  1. Do we have at least one example already coded for every of the family of distributions above listed?

Will reply with a list in the next post.

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 16, 2024

discrete distributions with finite support

Delta

discrete distributions with infinite support

Poisson

absolutely continuous distributions supported on a bounded interval

Beta, QPD_B or Uniform

absolutely continuous distributions supported on intervals of length 2π (directional distributions)

don't have that

absolutely continuous distributions supported on semi-infinite intervals ( e.g., [0,∞) )

LogNormal, Exponential

absolutely continuous distributions supported on the whole real line

Normal

absolutely continuous distributions with variable support

Uniform, QPD_S, and QPD_B have a support that depends on parameters - entries in an array distribution can have different support.

mixed discrete/continuous distributions

no "atomic" distribution of this type currently, but you can construct one using Mixture a discrete and continuous - hope that actually works as expected...

@VascoSch92
Copy link

That's a good question. I was thinking about working with properties and attributes primarily, but now that you mention it, we may consider tags as well. I have no clear answer to this yet, input appreciated.

A question is also if we are interested to the domain of a distribuition or to the support. I think the second one is more interesting right?

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 16, 2024

I think the second one is more interesting right?

Yes, for the moment it is, given that all distributions - even discrete ones - have a support that embeds canonically into the reals. With a distinction on continuous and discrete part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants