Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v0.2.0rc1 #695

Merged
merged 68 commits into from
Nov 24, 2021
Merged

Release v0.2.0rc1 #695

merged 68 commits into from
Nov 24, 2021

Conversation

bouthilx
Copy link
Member

Algorithm interface for suggest() and observe() is changed to use trials instead of list of points. This will provide more flexibility when designing new algorithms. There is no impact for users, except for algorithm plugins that will not be backward compatible with previous versions of Oríon.

The factory abstract class used to create database, storage, algorithm, adapters and parallel strategy has been redesigned to better support external implementations and be easier to debug. This redesign should also have no direct impact for users, except for backward compatibility issues with algorithm plugins for previous versions of Oríon.

🏗 Enhancements

🐛 Bug Fixes

bouthilx and others added 30 commits September 14, 2021 18:53
Merge back master in develop after release
If TPE keep sampling points already suggested, the suggest loop will run
infinitely. This adds a `max_retry` to limit the number of iterations on
the loop.
Why:

The meta-class factory was confusing and sometimes hard to debug. The
factory should be a separate object that the constructors of the
classes. When calling a class, we expect to receive an instance of the
class, not of a child class. The factory also makes it simpler to add
new features such as fetching class objects based on strings on grouping
singleton instances per factory.

How:

Each class requiring the factory will be using ``Factory(base_class)``
as a factory to instantiate child classes. The explicit method
``factory.create()`` will be used instead of the confusing
`MyFactory()`.

Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Improve Getting started example command
Why:

The algorithms will be working with trial objects instead of tuple of
values. This means the space needs to sample trials, and space
transformations should be applied on trials as well instead of tuples of
values.

How:

For simplicity, only interface of the space classes (TransformedSpace
and ReshapedSpace) will be working with trials. The transformations per
dimension will be applied using tuple of values so that, in particular,
reshaping operations remain straightforward.

To facilitate debugging, transformed trials are wrapped so that the
original trial can still be accessible. This will prove handy in
algorithms if we need access to original trial objects, and also because
the ID of the transformed trial should always be based on the original
parameters (otherwise the ID gets incoherent with the database).
The use of `__getattr__` to copy the TransformedTrial was causing an
infinite recursion error.
Why:

If the names of the dimensions have the same prefix, but some have a
shape, a transformed space with have a different ordering of dimension.
For example, the following dimensions will have their named sorted
differently:

  dim (shape 2), dim1 (no shape) -->  dim1, dim[0], dim[1]

This is causing an issue when we try to restore the shape of the
transformed dimension, with the names being swapped.

How:

When restoring shape, keep track of the original keys and their order,
and reassign the restored dimensions to the correct index (correct dim
name).
Why:

When flattening the space, dims of shape (1, ) should be flattened
as well otherwise the parameters will be a list of one element.
Why:

The deepcopy is failing on github-actions with error
`RuntimeError: dictionary changed size during iteration`. I have been
unable to reproduce the issue locally both with python 3.6 and 3.7. It
does fail on 3.7 on github-actions. Taking a copy of the dictionary to
do the deep copy should fix the issue only a dictionary inside some
trials is the source of the issue. The stack trace seams to hint towards
trials_info as the culprit however.

```
 tests/unittests/benchmark/test_benchmark_client.py:345:
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  src/orion/benchmark/__init__.py:90: in process
      study.execute(n_workers)
  src/orion/benchmark/__init__.py:341: in execute
      experiment.workon(self.task, n_workers=n_workers, max_trials=max_trials)
  src/orion/client/experiment.py:767: in workon
      for _ in range(n_workers)
  src/orion/executor/joblib_backend.py:32: in wait
      return joblib.Parallel(n_jobs=self.n_workers)(futures)
  .tox/py/lib/python3.7/site-packages/joblib/parallel.py:1056: in __call__
      self.retrieve()
  .tox/py/lib/python3.7/site-packages/joblib/parallel.py:935: in retrieve
      self._output.extend(job.get(timeout=self.timeout))
  /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/multiprocessing/pool.py:657: in get
      raise self._value
  /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/multiprocessing/pool.py:121: in worker
      result = (True, func(*args, **kwds))
  .tox/py/lib/python3.7/site-packages/joblib/_parallel_backends.py:595: in __call__
      return self.func(*args, **kwargs)
  .tox/py/lib/python3.7/site-packages/joblib/parallel.py:263: in __call__
      for func, args, kwargs in self.items]
  .tox/py/lib/python3.7/site-packages/joblib/parallel.py:263: in <listcomp>
      for func, args, kwargs in self.items]
  src/orion/client/experiment.py:781: in _optimize
      with self.suggest(pool_size=pool_size) as trial:
  src/orion/client/experiment.py:560: in suggest
      trial = reserve_trial(self._experiment, self._producer, pool_size)
  src/orion/client/experiment.py:54: in reserve_trial
      producer.produce(pool_size)
  src/orion/core/worker/producer.py:115: in produce
      self.algorithm.set_state(self.naive_algorithm.state_dict)
  src/orion/core/worker/primary_algo.py:47: in state_dict
      return self.algorithm.state_dict
  src/orion/algo/tpe.py:265: in state_dict
      _state_dict = super(TPE, self).state_dict
  src/orion/algo/base.py:132: in state_dict
      return {"_trials_info": copy.deepcopy(self._trials_info)}
  /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/copy.py:150: in deepcopy
      y = copier(x, memo)
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

  x = {'0ef99dad51485ac9518b49c19b43f4ec': (Trial(experiment=2, status='completed', params=x[0]:0.5497,x[1]:2.397,x[2]:-4.29...params=x[0]:-0.8816,x[1]:2.087,x[2]:1.176), {'constraint': [], 'gradient': None, 'objective': 1187.240632192948}), ...}
  memo = {140575410255504: datetime.datetime(2021, 11, 11, 22, 57, 32, 364584), 140575419437584: datetime.datetime(2021, 11, 11....datetime(2021, 11, 11, 22, 57, 32, 177829), 140575419438016: datetime.datetime(2021, 11, 11, 22, 57, 32, 241378), ...}
  deepcopy = <function deepcopy at 0x7fdac6dfb680>

      def _deepcopy_dict(x, memo, deepcopy=deepcopy):
          y = {}
          memo[id(x)] = y
  >       for key, value in x.items():
  E       RuntimeError: dictionary changed size during iteration
```
Refactor algorithms observe/suggest to work on trial objects directly
Why:

Currently only one promotion can be done for each suggest. This means
with pool-size greater than 1 ASHA would always end up suggesting many
new trials and only do 1 promotion at a time.
Why:

The deep copy required to save algorithm's state needed to deepcopy all
brackets, which are already copied as another attribute. We should
instead keep indices to corresponding bracket so that the dictionary is
more lightweight during the copy.
Why:

We had two levels of patience when reserving a trial. There was the
customizable `max_idle_time` used in producer.produce() to limit the
time spend trying to generate new trials, and there was `_max_depth` in
`reserve_trial` limiting the number of times a reservation would be
attempted and producer.produce would be called. This lead to misleading
error messages. For instance, with many workers it happened that a
worker would always be unable to reserve a trial because each time it
executed producer.produce() all other workers would reserve the trials
before the current worker had time to reserve one. In such scenario the
error message would be that the algorithm was unable to sample new point
and is waiting for trials to complete. It is not true. We should state
the number of trials that were generated during these reservation
attempts and recommend increasing the pool-size and timeout.

How:

Producer.produce only attempts producing `pool-size` once (calling
algo.suggest only once) and returns the number of successfully
produced trials. The whole patience is moved to `reserve_trial` where
it attempts reserving and producing until it reaches the timeout, in
which case a helpful error message is raised.
…iency

Improve resiliency of trial reservation
@bouthilx bouthilx added this to the v0.2 milestone Nov 24, 2021
@bouthilx bouthilx merged commit 6ee3d63 into master Nov 24, 2021
@bouthilx bouthilx deleted the release-v0.2.0rc1 branch November 24, 2021 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants