Release v0.2.0rc1 #695

bouthilx · 2021-11-24T01:00:41Z

Algorithm interface for suggest() and observe() is changed to use trials instead of list of points. This will provide more flexibility when designing new algorithms. There is no impact for users, except for algorithm plugins that will not be backward compatible with previous versions of Oríon.

The factory abstract class used to create database, storage, algorithm, adapters and parallel strategy has been redesigned to better support external implementations and be easier to debug. This redesign should also have no direct impact for users, except for backward compatibility issues with algorithm plugins for previous versions of Oríon.

🏗 Enhancements

Improve resiliency of trial reservation @bouthilx (Improve resiliency of trial reservation #693)
Support batched promotions in ASHA @bouthilx (Support batched promotions in ASHA #689)
Refactor algorithms observe/suggest to work on trial objects directly @bouthilx (Refactor algorithms observe/suggest to work on trial objects directly #686)
Improve Getting started exaple command @Delaunay (Improve Getting started exaple command #666)
Rework factory @bouthilx (Rework factory #681)

🐛 Bug Fixes

Avoid inf-loop in TPE.suggest @bouthilx (Avoid inf-loop in TPE.suggest #680)

Merge back master in develop after release

If TPE keep sampling points already suggested, the suggest loop will run infinitely. This adds a `max_retry` to limit the number of iterations on the loop.

Avoid inf-loop in TPE.suggest

Why: The meta-class factory was confusing and sometimes hard to debug. The factory should be a separate object that the constructors of the classes. When calling a class, we expect to receive an instance of the class, not of a child class. The factory also makes it simpler to add new features such as fetching class objects based on strings on grouping singleton instances per factory. How: Each class requiring the factory will be using ``Factory(base_class)`` as a factory to instantiate child classes. The explicit method ``factory.create()`` will be used instead of the confusing `MyFactory()`. Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>

Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>

Improve Getting started example command

Rework factory

Why: The algorithms will be working with trial objects instead of tuple of values. This means the space needs to sample trials, and space transformations should be applied on trials as well instead of tuples of values. How: For simplicity, only interface of the space classes (TransformedSpace and ReshapedSpace) will be working with trials. The transformations per dimension will be applied using tuple of values so that, in particular, reshaping operations remain straightforward. To facilitate debugging, transformed trials are wrapped so that the original trial can still be accessible. This will prove handy in algorithms if we need access to original trial objects, and also because the ID of the transformed trial should always be based on the original parameters (otherwise the ID gets incoherent with the database).

The use of `__getattr__` to copy the TransformedTrial was causing an infinite recursion error.

Why: If the names of the dimensions have the same prefix, but some have a shape, a transformed space with have a different ordering of dimension. For example, the following dimensions will have their named sorted differently: dim (shape 2), dim1 (no shape) --> dim1, dim[0], dim[1] This is causing an issue when we try to restore the shape of the transformed dimension, with the names being swapped. How: When restoring shape, keep track of the original keys and their order, and reassign the restored dimensions to the correct index (correct dim name).

Why: When flattening the space, dims of shape (1, ) should be flattened as well otherwise the parameters will be a list of one element.

Why: The deepcopy is failing on github-actions with error `RuntimeError: dictionary changed size during iteration`. I have been unable to reproduce the issue locally both with python 3.6 and 3.7. It does fail on 3.7 on github-actions. Taking a copy of the dictionary to do the deep copy should fix the issue only a dictionary inside some trials is the source of the issue. The stack trace seams to hint towards trials_info as the culprit however. ``` tests/unittests/benchmark/test_benchmark_client.py:345: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ src/orion/benchmark/__init__.py:90: in process study.execute(n_workers) src/orion/benchmark/__init__.py:341: in execute experiment.workon(self.task, n_workers=n_workers, max_trials=max_trials) src/orion/client/experiment.py:767: in workon for _ in range(n_workers) src/orion/executor/joblib_backend.py:32: in wait return joblib.Parallel(n_jobs=self.n_workers)(futures) .tox/py/lib/python3.7/site-packages/joblib/parallel.py:1056: in __call__ self.retrieve() .tox/py/lib/python3.7/site-packages/joblib/parallel.py:935: in retrieve self._output.extend(job.get(timeout=self.timeout)) /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/multiprocessing/pool.py:657: in get raise self._value /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/multiprocessing/pool.py:121: in worker result = (True, func(*args, **kwds)) .tox/py/lib/python3.7/site-packages/joblib/_parallel_backends.py:595: in __call__ return self.func(*args, **kwargs) .tox/py/lib/python3.7/site-packages/joblib/parallel.py:263: in __call__ for func, args, kwargs in self.items] .tox/py/lib/python3.7/site-packages/joblib/parallel.py:263: in <listcomp> for func, args, kwargs in self.items] src/orion/client/experiment.py:781: in _optimize with self.suggest(pool_size=pool_size) as trial: src/orion/client/experiment.py:560: in suggest trial = reserve_trial(self._experiment, self._producer, pool_size) src/orion/client/experiment.py:54: in reserve_trial producer.produce(pool_size) src/orion/core/worker/producer.py:115: in produce self.algorithm.set_state(self.naive_algorithm.state_dict) src/orion/core/worker/primary_algo.py:47: in state_dict return self.algorithm.state_dict src/orion/algo/tpe.py:265: in state_dict _state_dict = super(TPE, self).state_dict src/orion/algo/base.py:132: in state_dict return {"_trials_info": copy.deepcopy(self._trials_info)} /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/copy.py:150: in deepcopy y = copier(x, memo) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ x = {'0ef99dad51485ac9518b49c19b43f4ec': (Trial(experiment=2, status='completed', params=x[0]:0.5497,x[1]:2.397,x[2]:-4.29...params=x[0]:-0.8816,x[1]:2.087,x[2]:1.176), {'constraint': [], 'gradient': None, 'objective': 1187.240632192948}), ...} memo = {140575410255504: datetime.datetime(2021, 11, 11, 22, 57, 32, 364584), 140575419437584: datetime.datetime(2021, 11, 11....datetime(2021, 11, 11, 22, 57, 32, 177829), 140575419438016: datetime.datetime(2021, 11, 11, 22, 57, 32, 241378), ...} deepcopy = <function deepcopy at 0x7fdac6dfb680> def _deepcopy_dict(x, memo, deepcopy=deepcopy): y = {} memo[id(x)] = y > for key, value in x.items(): E RuntimeError: dictionary changed size during iteration ```

Refactor algorithms observe/suggest to work on trial objects directly

Why: Currently only one promotion can be done for each suggest. This means with pool-size greater than 1 ASHA would always end up suggesting many new trials and only do 1 promotion at a time.

Why: The deep copy required to save algorithm's state needed to deepcopy all brackets, which are already copied as another attribute. We should instead keep indices to corresponding bracket so that the dictionary is more lightweight during the copy.

Support batched promotions in ASHA

Why: We had two levels of patience when reserving a trial. There was the customizable `max_idle_time` used in producer.produce() to limit the time spend trying to generate new trials, and there was `_max_depth` in `reserve_trial` limiting the number of times a reservation would be attempted and producer.produce would be called. This lead to misleading error messages. For instance, with many workers it happened that a worker would always be unable to reserve a trial because each time it executed producer.produce() all other workers would reserve the trials before the current worker had time to reserve one. In such scenario the error message would be that the algorithm was unable to sample new point and is waiting for trials to complete. It is not true. We should state the number of trials that were generated during these reservation attempts and recommend increasing the pool-size and timeout. How: Producer.produce only attempts producing `pool-size` once (calling algo.suggest only once) and returns the number of successfully produced trials. The whole patience is moved to `reserve_trial` where it attempts reserving and producing until it reaches the timeout, in which case a helpful error message is raised.

…iency Improve resiliency of trial reservation

bouthilx and others added 30 commits September 14, 2021 18:53

Update backward comp test versions

d42168e

Merge pull request #664 from Epistimio/ci/sync_master_back_to_dev

da00153

Merge back master in develop after release

Improve Getting started exaple command

8775293

Avoid inf-loop in TPE.suggest

ae59fc4

If TPE keep sampling points already suggested, the suggest loop will run infinitely. This adds a `max_retry` to limit the number of iterations on the loop.

Adjust TPE config in tests

2d70517

Merge pull request #680 from bouthilx/hotfix/tpe_inf_loop

8b4d2f5

Avoid inf-loop in TPE.suggest

Convert Database and Storage to GenericFactory

021351a

Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>

Convert Strategy to GenericFactory

14d52d2

Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>

Convert algo to GenericFactory

d543be0

Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>

Adapt benchmarking module to GenericFactory

a8f59d0

Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>

Adapt EVC Adapter to GenericFactory

93a5fd9

Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>

Convert converter (lol) to GenericFactory

5292ab0

Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>

Convert executor to GenericFactory

52d93fe

Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>

Use factory for db singleton test

9f1d2ae

blackify

9c6602a

Merge pull request #666 from Epistimio/feature_contrib

4b73db2

Improve Getting started example command

Merge pull request #681 from bouthilx/feature/rework_factory

bb80d53

Rework factory

Convert Space to sample trials instead of points

c267fe6

Adapt Primary Algo to use trials instead of points

8fdd52f

Adapt base algorithm class to using trial instead of points

b64d1ae

Test that get_id is not sensitive to experiment id

9a25e52

Test that TransformedTrial can be copied.

c3c50a8

The use of `__getattr__` to copy the TransformedTrial was causing an infinite recursion error.

Adjust RandomSearch to new inferface

72022e1

Test that verify_ttrial is using passed space

e6ec755

Adapt grid search to new interface

3d4f54b

Support hierarchical params in space.__contains__

ed70273

Adjust format_trials test to new fixed_suggestion

e15723c

bouthilx added 24 commits November 11, 2021 14:44

pylint

7ad527b

Remove points doc (module removed)

5cee651

Fix doc reference

cf1ac0b

Dims of shape (1, ) should use Views

6b06eb9

Why: When flattening the space, dims of shape (1, ) should be flattened as well otherwise the parameters will be a list of one element.

Add entry points for Database

2d43e2c

Add missing testing functions for trials

e9b3c08

Remove trailing TODO notes

63eda39

Merge pull request #686 from bouthilx/feature/rework_suggest_observe

c130ddb

Refactor algorithms observe/suggest to work on trial objects directly

Support batched promotions in ASHA

f997a3c

Why: Currently only one promotion can be done for each suggest. This means with pool-size greater than 1 ASHA would always end up suggesting many new trials and only do 1 promotion at a time.

Test ASHA promotions in suggest

5231ab9

black

a53aa6c

Adjust budgets for ASHA tests

9cc7c5f

Seed Hyperband properly

1386ee5

Grid search rng init test should verify identity

a55b9c1

Adjust Hyperband (and variants) to point->trial refactoring

c3ecc68

Cache samples in hyperband bracket

23df886

isort

df13bf4

Merge pull request #689 from bouthilx/hotfix/asha_batch_suggest

aabb02f

Support batched promotions in ASHA

Merge pull request #693 from bouthilx/feature/improved_producer_resil…

9ca27cf

…iency Improve resiliency of trial reservation

Remove deprecations for v0.2.0

f471142

Update doc to v0.2.0

93e49b5

bouthilx added the release label Nov 24, 2021

bouthilx added this to the v0.2 milestone Nov 24, 2021

bouthilx merged commit 6ee3d63 into master Nov 24, 2021

bouthilx deleted the release-v0.2.0rc1 branch November 24, 2021 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.2.0rc1 #695

Release v0.2.0rc1 #695

bouthilx commented Nov 24, 2021

Release v0.2.0rc1 #695

Release v0.2.0rc1 #695

Conversation

bouthilx commented Nov 24, 2021

🏗 Enhancements

🐛 Bug Fixes