-
Notifications
You must be signed in to change notification settings - Fork 0
/
hyper.tex
195 lines (172 loc) · 8.62 KB
/
hyper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
\section{Hyper-optimization}
\label{sec:hyper-opt}
We want to address the problem of finding the minimum of an {\em unknown}\ function $f$,
\begin{equation}
\label{eq:FDef}
f: \Omega \to \mathbb{R}\, ,
\end{equation}
where $\Omega$ is a compact subset of $\Rd$. Although $f$ is unknown, we assume
that we can perform {\em noisy}\ estimates of $y=f(x)$ for given values of $x$,
such that
\begin{equation}
\label{eq:UnbiasedEval}
\Ebb\left[y|f(x)\right] = f(x)\, ,
\end{equation}
where the average here is over the fluctuations of $y$. Our knowledge of the function $f$ is encoded in the probability distribution function of $f(x)$ for each value of $x$. At the beginning of the process we need to choose a prior distribution for these values. The prior distribution is then updated according to the value of noisy measurements of $f$ at given values of $x$.
Bayesian Optimization is an algorithm that performs a {\em sequential} search by
selecting at every step a position $x$ in $\Omega$ according to some acquisition
function, and computing the corresponding value of $y$. After $\nopt$ iterations
the algorithm yields its current best estimate of the position of the minimum.
The procedure is summarised in Fig.~\ref{fig:BOpt}, taken from
Ref.~\cite{shahriari2015taking}.
\begin{figure}[ht!]
\label{fig:BOpt}
\centering
\includegraphics[scale=0.5]{BayesianOptAlg.png}
\caption{\small Summary of the Bayesian Optimization procedure.}
\end{figure}
\paragraph[]{Summary} The important ingredients in BO are the following.
%
\begin{enumerate}
\item[$\bullet$] Choice of the space of parameters, which we denoted $\Omega$ above.
\item[$\bullet$] Choice of the loss function $f$, and methodology to evaluate it.
\item[$\bullet$] Update of the prior distribution for $f(x)$.
\item[$\bullet$] Choice of the acquisition function $\alpha$.
\item[$\bullet$] Choice of the value of $x$ for a new measurement, given the acquisition function.
\end{enumerate}
%
We need to specify each of these steps in our procedure.
\subsection{Parametric Models}
\label{sec:BaysParMod}
There are interesting results on parametric models, which are useful to
understand the idea behind BO. In parametric models the function $f$ is
specified by a set of parameters $w$. The current knowledge of $f$ is then
encoded in the probability distribution of these parameters.
\paragraph[]{Explicit Examples}
\subsection{Non-parametric Models}
\label{sec:BaysNonParMod}
A Gaussian Process $\mathrm{GP}(\mu, k)$ is fully characterised by its mean
function $\mu: \Omega\to \Rbb$, and its kernel function $k: \Omega \times \Omega
\to \Rbb$. We are going to use a Gaussian Process to define the prior for the
unknown function $f$. For a collection of points $\{x_i, i=1,\ldots,n\}$, we
have a set of values $\{f_i=f(x_i), i=1,\ldots,n\}$ and their noisy estimates
$\{y_i, i=1\ldots, n\}$. Remember that $x_i\in\Omega$ is a $d$-dimensional real
vector for every value of $i$. Their respective distributions will be:
\begin{align}
\label{eq:GPone}
\mathbf{f}|\mathbf{x} &\sim
\mathcal{N}\left(\mathbf{m},\mathbf{K}\right)\, ,\\
\mathbf{y}|\mathbf{f},\sigma &\sim
\mathcal{N}\left(\mathbf{f},\sigma^2\right)\, ,
\end{align}
where $m_i=\mu(x_i)$ and $K_{ij}=k(x_i,x_j)$. In Eq.~\eqref{eq:GPone} we denote
by boldface fonts vectors and matrices in the $n$-dimensional space of
measurements that have been performed.
After $n$ observations $\mathcal{D}_n=\left\{\left(x_i,y_i\right)\right\}$, the random variable $f(x)$ is Gaussianly distributed,
\begin{equation}
\label{eq:PostGaussF}
f(x)\sim \mathcal{N}\left(\mu_n(x),\sigma_n(x)\right)\, ,
\end{equation}
with
\begin{align}
\label{eq:PostParams}
\mu(x) &= \mu(x) + \mathbf{k}(x)^T \left(\mathbf{K} + \sigma^2 \mathbf{I}\right)^{-1} \left(\mathbf{y}-\mathbf{m}\right)\, , \\
\sigma(x)^2 &= k(x,x) - \mathbf{k}(x)^T \left(\mathbf{K} + \sigma^2 \mathbf{I}\right)^{-1} \mathbf{k}(\mathbf{x})\, .
\end{align}
This posterior distribution encodes our knowledge of the function $f$ at a
generic point $x\in\Omega$. It is then used to select the next point in the
search.
\subsection{Acquisition Function}
\label{sec:AcqFun}
In BO the choice of the next point $x$ where the function $f$ is going to be
evaluated plays a crucial role. Choosing a random point would be extremely
inefficient, especially if the dimensionality of the space is increased.
An informed choice of the point $x$ can be made from the knowledge of the
posterior distribution of $f$, and a utility function $U(x,v)$, where $v=f(x)$.
The utility function describes in some sense how much information is provided by
a measurement at $x$. Since $f$ is unknow, we can marginalise with respect to
$v$ using the probability distribution discussed above and define the {\em
expected utility} of a query point:
\begin{equation}
\label{eq:ExpUtilDef}
\alpha(x,\mathcal{D}_n) = \Ebb_{v|\mathcal{D}_n}\left[U\left(x,v\right)\right]\, .
\end{equation}
\paragraph[]{Improvement-based policies}
The acquisition function favours points that are likely to exceed some target value $\tau$; this strategy is called {\em probability of improvement}\ (PI) and for the case of a Gaussian distribution for $f$ we obtain
\begin{equation}
\label{eq:AlphaPI}
\alpha_\PI\left(x;\mathcal{D}_n\right) =
\Pbb\left[v>\tau\right] =
\Phi\left(
\frac{\mu_n(x)-t}{\sigma_n(x)}
\right)\, ,
\end{equation}
where $\Phi$ is the normal cumulative distribution function, \ie\ $\Phi(\zeta)$
is the integral of a Gaussian distribution with unit variance from $-\infty$ to
$\zeta$. This choice corresponds to the utility function
$U(x,v)=\Ibb\left[v>\tau\right]$ being marginalised over $v$.
A similar acquisiton function can be defined trying to quantify the amount of
improvement. In this case the utility function is
\begin{equation}
\label{eq:EIUtil}
U(x,v) = (v-\tau) \Ibb\left[v>\tau\right]\, ,
\end{equation}
and the acquisition function is defined as above by taking the expectation value with respect to $v$:
\begin{equation}
\label{eq:EIAcq}
\alpha_\EI \left(x;\mathcal{D}_n\right) =
\Ebb_{v|\mathcal{D}_n}\left[U(v,x)\right]\, .
\end{equation}
For the case of the GP, the expectation value above can be computed explicitly
\begin{equation}
\label{eq:EIAcqGP}
\alpha_\EI \left(x;\mathcal{D}_n\right) =
\left(\mu_n(x)-\tau\right) \,
\Phi\left(\frac{\mu_n(x)-t}{\sigma_n(x)}\right)
+ \sigma_n(x) \phi\left(\frac{\mu_n(x)-t}{\sigma_n(x)}\right)\, ,
\end{equation}
where $\phi$ is the standard normal probability density. The target value $\tau$
can be adjusted adaptively during the minimization process.
\paragraph[]{Optimistic policies}
The Gaussian process upper confidence bound (GP-UCB) is defined by the
acquisiton function
\begin{equation}
\label{eq:GB-UCB}
\alpha_\mathrm{UCB}\left(x;\mathcal{D}_n\right) =
\mu_n(x) + \beta_n \sigma_n(x)
\end{equation}
where $\beta_n$ is a parameter that can be tuned, just like $\tau$ in the case
of improvement-based algorithms. See \eg\ Ref.~\cite{Srinivas_2012} for a
detailed discussion.
\paragraph[]{Information-based policies}
These policies are based on some knowledge of the posterior distribution of the minimizer $x^*$.
{\em Thompson sampling}\ is an example of information-based policies. The
algorithm samples the reward function $f$ using the posterior determined by the
data, and selects the value of $x$ that maximises the simulated reward. In other words
\begin{equation}
\label{eq:TSAcq}
\alpha_\mathrm{TS}\left(x;\mathcal{D}_n\right) =
f^{(n)}(x)\, ,
\end{equation}
where $f^{(n)}\sim \mathrm{GP}\left(\mu,k|\mathcal{D}_n\right)$.
{\em Entropy search}\ algorithms choose the next point $x$ by maximising the
reduction in entropy of the distribution $p_*(x|\mathcal{D}_n)$. The utility
function is
\begin{equation}
\label{eq:ESUtil}
U(x,v) = H\left(x^*|\mathcal{D}_n\right) -
H\left(x^*|\mathcal{D}_n \cup \left\{x,v\right\}\right)\, .
\end{equation}
Here we need to clarify the definition in Ref.~\cite{shahriari2015taking}. See
Ref.~\cite{hernandez2014predictive} for details. The acquisition function is
obtained by taking the average over $v$:
\begin{equation}
\label{eq:ESAcq}
\alpha_\mathrm{ES}\left(x;\mathcal{D}_n\right) =
H\left(x^*|\mathcal{D}_n\right) -
\Ebb_{v|\mathcal{D}_n}\left[H\left(x^*|\mathcal{D}_n \cup \left\{x,v\right\}\right)\right]\, ,
\end{equation}
where $H\left(x^*|\mathcal{D}_n\right)$ is the differential entropy of the
posterior distribution $p_*(x|\mathcal{D}_n)$. See
Refs.~\cite{hennig2012entropy,villemonteix2009informational} for details, and
other variations like {\em predictive entropy}\ search.