diff --git a/_posts/2023-04-21-denoising-diffusion-impl/denoising-diffusion-impl.Rmd b/_posts/2023-04-21-denoising-diffusion-impl/denoising-diffusion-impl.Rmd
new file mode 100644
index 00000000..f0622511
--- /dev/null
+++ b/_posts/2023-04-21-denoising-diffusion-impl/denoising-diffusion-impl.Rmd
@@ -0,0 +1,793 @@
+---
+title: "Implementing Denoising Diffusion with torch and luz"
+description: >
+  In this blog post we walk you trough the implementation of denoising diffusion
+  probability models using the torch and luz package.
+author:
+  - name: Daniel Falbel
+    affiliation: Posit
+    affiliation_url: https://www.posit.co/
+slug: falbeldenoisingdiff
+date: 2023-04-21
+categories:
+  - Torch
+  - R
+  - Meta
+  - Concepts
+  - Packages/Releases
+output:
+  distill::distill_article:
+    self_contained: false
+    toc: true
+preview: images/overview.png
+bibliography: references.bib
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, eval = FALSE, fig.width = 6, fig.height = 6)
+```
+
+If you read the [previous blog post](https://blogs.rstudio.com/ai/posts/2023-04-13-denoising-diffusion/) you might be curious on how the denoising diffusion 
+models can be implemented in torch. You might have also read the [README](https://github.com/dfalbel/denoising-diffusion/blob/main/README.md) in the denoising 
+diffusion repository, but it doesn't go into much details on it's implemented. In this 
+blog post we will try to explain the code and how it relates to formulaes in the 
+reference papers.
+
+We are going to use `luz` to train the denoising diffusion model. Using luz allows
+us to avoid much of the torch boilerplate code related to supporting training in different
+devices like the cpu or the GPU, logging metrics, disabling autograd during evaluation, etc.
+Our implementation is heavily inspired by the [Denoising Diffusion Implicit Models](https://keras.io/examples/generative/ddim/) example in the Keras gallery.
+
+However, since we are not building a standard supervised model, we will need to 
+implement a custom `step` method.
+
+## Denoising diffusion training overview
+
+We will start with a very high level overview of how denoising diffusion models
+are trained. The theoretical explanations will come later as we detail each step.
+In very practical terms, denoising diffusion models work in the following way:
+
+```{r fig.cap="Overview of denoising diffusion training procedure", echo=FALSE, eval=TRUE, out.width="100%"}
+knitr::include_graphics("images/overview.png")
+```
+
+The general idea is to take real images, add Gaussian noise to them. The amount of
+noise that is added depends on the step of the diffusion process, but during
+training this is sampled. 
+
+A neural network, taking as input the noisy image and the **noise rate**, is trained
+to predict the amount of noise for each pixel in the noisy image by minimizing the 
+loss (usually the MSE) between the actual sampled noise and the predicted noise.
+
+## Implementing the nn_module
+
+In order to use luz, we need to define a torch `nn_module()`, which contains the
+definition of the model we want to train. `nn_module`s in torch requires the user
+to implement the `initialize` and `forward` methods. 
+
+In `initialize` we declare the components of that model, what are it's parameters,
+layers, etc. In `forward` we describe how the components relate to each other.
+
+We'll start by defining the initialize method. This method will initialize the following
+attributes of the model:
+
+* `diffusion`:  the neural network that is used to predict the noise for each pixel
+  given the noisy image. To initialize this neural net we need to specify the image
+  size and other hyper-parameters like the embeddings size, the widths and depths of
+  the neural net blocks. More details on this neural net later. For now, it's suffiencient
+  to know that it takes a noisy image and the noise rate as input and tries to predict
+  the noise for each pixel.
+  
+* `schedule`: the schedule determines the amount of noise that should be added to
+  the real image depending on the diffusion step we are in. The schedule has a few
+  configurations related to how it changes the amount of noise for each step.
+  
+* `loss`: The loss function used to train the model. The `diffusion` network will
+  be trained to minimize that loss.
+
+* `loss_on`: In the diagram from the first section we showed that the neural network
+  learns to predict the amount of noise for each pixel, and indeed this seems to lead
+  to better results. However, the network could be trained to predict the real image
+  and in this case the loss would be computed between the real image and the predicted
+  image. This parameter allows us to switch between these two approaches.
+
+The code for defining the initialize method is shown below. Note that this not the
+full module code, we are just showing a small portion so it's easier to navigate.
+
+```{r}
+diffusion_model <- nn_module(
+  initialize = function(image_size, 
+                        embedding_dim = 32, 
+                        widths = c(32, 64, 96, 128), 
+                        block_depth = 2,
+                        schedule = NULL,
+                        loss = "noise", 
+                        loss_on = NULL) {
+    
+    self$diffusion <- diffusion(image_size, embedding_dim, widths, block_depth)
+    
+    schedule <- if(is.null(schedule)) 
+      diffusion_schedule_config("cosine", 0.02, 0.95)
+    
+    self$diffusion_schedule <- do.call(diffusion_schedule, schedule)
+    self$image_size <- image_size
+    self$normalize <- normalize(image_size[1])
+    
+    self$loss <- if (is.null(loss)) nnf_l1_loss else loss
+    self$loss_on <- loss_on
+  }
+  # ... implementation of the other methods
+)
+```
+
+We will now define the forward method, which defines how the components initialized
+in the above method relate to each other. We start by defining the `denoise` method,
+the `forward` method will be just a wrapper around it, but for readability later it's
+nice to have a method called `denoise`.
+
+The denoise method takes as input a batch of noisy image (`images` parameter) and
+a batch of rates (the amount of noise/signal for each image). It then computes the
+predicted real image and the predicted amount of noise. The way it computes depends
+on the `loss_on` configuration, i.e. if the neural network is trained to predict the
+noise or the image. Given the prediction for one of them, the rates and the noisy images,
+it's possible to compute the other. 
+
+In simpler terms, the `denoise` method is responsible for separating the noise and the signal
+from noisy images. And since `forward` is just a wrapper around `denoise`, this is what 
+happens when we call `model(images, rates)`.
+
+```{r}
+diffusion_model <- nn_module(
+  # ... other methods
+  denoise = function(images, rates) {
+    if (self$loss_on == "noise") {
+      pred_noises <- self$diffusion(images, rates$noise^2)
+      pred_images <- (images - rates$noise * pred_noises) / rates$signal
+    } else {
+      pred_images <- self$diffusion(images, rates$signal^2)
+      pred_noises <- (images - rates$signal * pred_images) / rates$noise
+    }
+
+    list(
+      pred_noises = pred_noises,
+      pred_images = pred_images
+    )
+  },
+  forward = function(images, rates) {
+    self$denoise(images, rates)
+  }
+  # other methods
+)
+```
+
+We'll now get into how this model is trained, but first a quick introduction on
+how to customize the training loop with luz.
+
+## Quick intro to luz custom steps
+
+luz provides different levels of abstraction of the training loop and depending on
+the type of model you are implementing, you might one to use one or another.
+
+On the highest level, and the simpler models to implement with luz, are the supervised models.
+In this setup, for each batch during training, luz will execute it's **default step** which looks like like the following:
+
+```{r}
+loss <- ctx$model$loss(ctx$model(ctx$input), ctx$target)
+loss$backward()
+ctx$opt$step()
+ctx$opt$zero_grad()
+```
+
+The `ctx` object in luz is holding the context of that training step, which includes
+the model that is being trained, the input and target batches and the optimizer.
+You can see more information about the context [here](https://mlverse.github.io/luz/reference/ctx.html).
+
+Users can override the **default step** by implementing the `step` method in their 
+`nn_module()`. This is useful when the model in case needs a different procedure
+for training than the default supervised procedure. More details on customizing the
+`step` can be found [here](https://mlverse.github.io/luz/articles/custom-loop.html#fully-flexible-step).
+
+## Implementing the step
+
+We now implement the custom training step for our module. We want it to take a batch
+of input images, successively add noise to them, then train the neural net to
+be good at denoising those noisy images by minimizing a loss and applyting gradient descent.
+
+Here's the implementation of `step()`:
+
+```{r}
+diffusion_model <- nn_module(
+  # ... other methods
+  step = function() {
+    ctx$input <- images <- ctx$model$normalize(ctx$input)
+    
+    diffusion_times <- torch_rand(images$shape[1], 1, 1, 1, device = images$device)
+    rates <- self$diffusion_schedule(diffusion_times)
+    
+    ctx$buffers$noises <- noises <- torch_randn_like(images)
+    noisy_images <- rates$signal * images + rates$noise * noises
+    
+    ctx$pred <- ctx$model(noisy_images, rates)
+    
+    loss <- if (self$loss_on == "noise") {
+      self$loss(noises, ctx$pred$pred_noises)
+    } else if (self$loss_on == "image") {
+      self$loss(images, ctx$pred$pred_images)
+    }
+    
+    if (ctx$training) {
+      ctx$opt$zero_grad()
+      loss$backward()
+      ctx$opt$step()
+    }
+    
+    ctx$loss[[ctx$opt_name]] <- loss$detach()
+  }
+  # ... other methods
+)
+```
+
+Note the first 5 lines of code (repeated below), where we add noise to the images.
+This is the so called **forward diffusion process*.
+
+The first technical detail here is that luz will make the data available to the
+`step()` trough the context, for example, `ctx$input` will contain the batch of 
+input images. We first normalize the images, so they have zero mean and a standard 
+deviation of 1 channel-wise.
+
+We then sample `diffusion_times` uniformly between 0 and 1. The diffusion time 
+represents the current step of the diffusion model. 0 means we are in the begging
+of the diffusion process. 1 means that we are in the end of the diffusion process.
+During training we randomly select diffusion times so the model sees many different
+parts of the diffusion process, later, when we sample images we will start with pure 
+noise (i.e. `diffusion_times=1`) and slowly reduce it until it's 0 - which would 
+give us a image with no noise added.
+
+The diffusion times are passed to the denoising schedule, which regulates the amount
+of noise depending on the diffusion time. We expect the schedule to give higher noise
+rates the later we are in the diffusion process, i.e. the closer diffusion times is to
+one.
+
+We then sample noises from the standard Normal distribution and combine them with
+the original images to create the noisy images - which will be the input to the
+neural net. Here another technical detail is that we also save the noises to 
+`ctx$buffer$noises`, this is only to allow us to get the noise values in other
+methods, for example when using a luz callback.
+
+```{r}
+ctx$input <- images <- ctx$model$normalize(ctx$input)
+    
+diffusion_times <- torch_rand(images$shape[1], 1, 1, 1, device = images$device)
+rates <- self$diffusion_schedule(diffusion_times)
+
+ctx$buffers$noises <- noises <- torch_randn_like(images)
+noisy_images <- rates$signal * images + rates$noise * noises
+```
+  
+One thing that might be strange at this point is that we said that we successively
+add noise to the image for a few diffusion steps, but here, we just add noise once.
+The next section gives you the theoretical background that allows us to do it in
+just one computation.
+
+## Forward diffusion
+
+As we mentioned in the previous blog post, this process consists of successively 
+transforming an input image into noise. In mathematical terms, 
+we define $x_0$ is the true image. We then define the distribution of images at 
+the **diffusion step** $t$ is:
+
+$$q(x_{t} | x_{t-1}) = \mathcal{N}(x_{t-1} \sqrt{1 - \beta_t}, I\beta_t )$$ 
+
+with $\beta_t \in (0,1)$, the diffusion rate. During training we have to be able 
+to sample from sample from $q(x_t | x_0)$. Using the above formula we would need
+a `for` loop that successively adds noise, in pseudo code:
+
+```{r}
+for (t in 1:n_steps) {
+  image[t] <- images[t-1] + beta[t] * torch_randn_like(images[t-1])
+}
+```
+
+To make computations easier, we reparametrize this as:
+
+$$x_t = \sqrt{\bar{\alpha_t}}x_0 + \sqrt{1-\bar{\alpha_t}}\epsilon$$
+
+with $\epsilon = \mathcal{N}(0, I)$. We define $\bar{\alpha}_t$ with a **schedule** that 
+is a strictly decreasing function of $t$ - and also usually defined in terms of $T$ the 
+maximum number of diffusion steps - so that $\bar{\alpha}_T$ is very close to 0.
+
+This means that, with a reparametrization trick, we can simplify the code so we don't
+need to use the `for` loop, which would also require us to use discrete diffusion times
+which complicates further.
+
+## step() continued
+
+We now explain the last few lines of the `step()` method that will execute the 
+training procedure. For easier navigation, we will repeat the last lines below.
+
+As we saw earlier, `noisy_images` now contains the noisy images created by adding noise
+to them. `rates` is a measure of the noise/signal there's in each image.
+
+We now use `ctx$model()`, which in turns call's `forward` and consequently `denoise`
+on the noisy images. This applies the neural network and generates the predictions
+for both the real image and the noise.
+
+Next we compute the loss, depending on the `loss_on` configuration, we will compute
+it on images or on noise. `step()` in luz is called both during the training and
+the validation loop, thus only if `ctx$training` we will update the weights of the
+neural net by using the usual torch `zero_grad(); backward(); step()` procedure.
+
+Finally we log the loss in the `ctx$loss` context attribute, which is used by luz
+to display the loss in the progress bar and save it in the training history.
+
+```{r}
+ctx$pred <- ctx$model(noisy_images, rates)
+    
+loss <- if (self$loss_on == "noise") {
+  self$loss(noises, ctx$pred$pred_noises)
+} else if (self$loss_on == "image") {
+  self$loss(images, ctx$pred$pred_images)
+}
+
+if (ctx$training) {
+  ctx$opt$zero_grad()
+  loss$backward()
+  ctx$opt$step()
+}
+
+ctx$loss[[ctx$opt_name]] <- loss$detach()
+```
+
+We have now finished the first pass trough the diffusion model and we this overview
+we could start training it. But before that, let's dive a little deeper into the
+architecture of the neural net that helps separating noise from signal from the
+noisy images.
+
+## Neural net architecture
+
+This neural net is basically a [U-Net](https://en.wikipedia.org/wiki/U-Net) (see also previous [blog post](https://blogs.rstudio.com/ai/posts/2019-08-23-unet/)) with a very small change
+in how it deals with inputs in order to be able to accommodate the `rates` input.
+
+We chose using a U-Net here as we are trying to learn the distribution of an image
+dataset. If a different kind of data were to be used, e.g. text data or tabular data,
+we would need a different combination of layers that makes more sense for the data domain.
+
+### Unet
+
+A U-Net is a convolutional neural network that successively downsamples the image resolution while increasing its depth. After a few downsampling blocks, it starts upsampling the representation and decreasing the channels depth. The main idea in U-Net, is that much like Residual Networks, the upsampling blocks take as input both the representation from the previous upsampling block and the representation from a previous downsampling block.
+
+```{r fig.cap="U-Net model", eval = TRUE, echo=FALSE, out.width="50%"}
+knitr::include_graphics("https://blogs.rstudio.com/ai/posts/2019-08-23-unet/images/unet.png")
+```
+
+Unlike the original U-Net implementation, we use ResNet Blocks in the downsampling and upsampling blocks of the U-Net. Each down or upsampling blocks contain *block_depth* of those ResNet blocks. We also use the [Swish activation](https://en.wikipedia.org/wiki/Swish_function) function. We report those
+choises even though we don't think they are necessary for the model performance.
+
+### Sinusoidal embeddings
+
+A sinusoidal embedding [@vaswani2017] is used to encode the diffusion times (or 
+the noise variance) into the model. The visualization below shows how diffusion 
+times are mapped to the embedding - assuming the dimension size of 32. Each row 
+is a embedding vector given the diffusion time. Sinusoidal embedding have nice 
+properties, like preserving the relative distances [@kazemnejad2019:pencoding].
+
+```{r fig.cap="Sinusoidal embeddings vector", eval=TRUE, echo=FALSE, out.width="50%"}
+knitr::include_graphics("images/sinusoidal-1.png")
+```
+
+The noise variance (or the weight of the noise in the resulting noisy image) is 
+encoded into a vector which is then upsampled to the same size of the input image.
+They are then concatenated in the channels dimension and passed to the U-Net.
+
+### Implementing the architecture
+
+Below we show the code that does this work. This is the neural net module that is
+initialized in the `diffusion_model` module in the second section of this article.
+
+We could go into more details on the `unet` and the `sinusoidal_embedding` but we
+will leave up to the reader to search for the code in the [GitHub repository](https://github.com/dfalbel/denoising-diffusion).
+
+```{r}
+diffusion <- nn_module(
+  initialize = function(image_size, embedding_dim = 32, 
+                        widths = c(32, 64, 96, 128), block_depth = 2) {
+    self$unet <- unet(2*embedding_dim, embedding_dim, widths = widths, block_depth)
+    self$embedding <- sinusoidal_embedding(embedding_dim = embedding_dim)
+
+    self$conv <- nn_conv2d(image_size[1], embedding_dim, kernel_size = 1)
+    self$upsample <- nn_upsample(size = image_size[2:3])
+
+    self$conv_out <- nn_conv2d(embedding_dim, image_size[1], kernel_size = 1)
+    purrr::walk(self$conv_out$parameters, nn_init_zeros_)
+  },
+  forward = function(noisy_images, noise_variances) {
+    embedded_variance <- noise_variances |>
+      self$embedding() |>
+      self$upsample()
+
+    embedded_image <- noisy_images |>
+      self$conv()
+
+    unet_input <- torch_cat(list(embedded_image, embedded_variance), dim = 2)
+    unet_output <- unet_input %>%
+      self$unet() %>%
+      self$conv_out()
+  }
+)
+```
+
+## Schedule
+
+As mentioned earlier, the schedule determines the amount of noise that should be 
+added to the real image depending on the diffusion step we are in. For instance,
+let's consider the 'cosine' schedule that we use by default.
+
+We implement it as `nn_module()` even though it doesn't have any trainable
+'weights'. Technically at `diffusion_time=0`, the signal rate should be zero
+and at `diffusion_time=1` it should be 0, however this leads to problems when
+adjusting the model, thus we use the `min_signal_rate` and `max_signal_rates` 
+that avoids the schedule to return values in the extreme.
+
+```{r}
+cosine_schedule <- nn_module(
+  initialize = function(min_signal_rate = 0.02, max_signal_rate = 0.98) {
+    self$start_angle <- nn_buffer(torch_acos(max_signal_rate))
+    self$end_angle <- nn_buffer(torch_acos(min_signal_rate))
+  },
+  forward = function(diffusion_times) {
+    angles <- self$start_angle + diffusion_times*(self$end_angle - self$start_angle)
+
+    list(
+      signal = torch_cos(angles),
+      noise = torch_sin(angles)
+    )
+  }
+)
+```
+
+The graph below shows the values of the signal and noise rates and we progress
+in the diffusion process:
+
+```{r, echo=FALSE}
+# script to get the plot below
+library(torch)
+library(tidyverse)
+cosine_schedule <- nn_module(
+  initialize = function(min_signal_rate = 0.02, max_signal_rate = 0.98) {
+    self$start_angle <- nn_buffer(torch_acos(max_signal_rate))
+    self$end_angle <- nn_buffer(torch_acos(min_signal_rate))
+  },
+  forward = function(diffusion_times) {
+    angles <- self$start_angle + diffusion_times*(self$end_angle - self$start_angle)
+    
+    list(
+      signal = torch_cos(angles),
+      noise = torch_sin(angles)
+    )
+  }
+)
+
+x <- cosine_schedule()
+result <- x(seq(0, 1, length.out = 100))
+
+df <- as.data.frame(lapply(result, as.numeric))
+df$diffusion_time <- seq(0, 1, length.out = 100)
+df %>% 
+  pivot_longer(cols = c(signal, noise), "type") %>% 
+  ggplot(aes(x = diffusion_time, y = value, color = type)) +
+  geom_line() +
+  labs(
+    title = "Signal and noise rates depending on the diffusion time",
+    x = "Diffusion time",
+    y = "Rate",
+    color = "Rate of:"
+  ) +
+  theme_minimal()
+```
+
+```{r, eval=TRUE, fig.cap="Noise/Signal rates per diffusion time", echo=FALSE, out.width="100%"}
+knitr::include_graphics("images/schedules.png")
+```
+
+The image below illustrates what it means to apply an *amount of noise* to the 
+images. We took a real image and added noise to them according to the cosine schedule
+(first row) and linear schedule (second row) for lienarly spaced diffusion times ranging
+ranging from 0 to 1.
+
+```{r eval=TRUE, layout="l-body-outset", fig.cap="Images transformed by the forward diffusion process at different diffusion times", echo=FALSE, out.height="130px"}
+knitr::include_graphics("images/schedules-1.png")
+```
+
+## Training the model
+
+We have defined the model training loop in `step()` and the neural net architecture
+that will be used to make predictions. We can now train the model using this procedure.
+
+Before training though we need training data. In our experiments we trained the model 
+with two different datasets: the [Oxford Pets](https://www.robots.ox.ac.uk/~vgg/data/pets/) 
+dataset and the [Oxford Flowers](https://www.robots.ox.ac.uk/~vgg/data/flowers/).
+
+### Dataset
+
+Fortunately those datasets are readily available in the torchdatasets package, thus
+we don't need to figure out how to download and read the images from it. We had only
+to implement a transform that would center crop and resize the images so they all
+have the same size. 
+
+We created a wrapper dataset, that takes either the pets dataset or the flowers dataset
+and applies the transforms to the input data. Center cropping the image before resizing
+is a good idea for generative models so you preserve the original scale of the image.
+
+```{r}
+diffusion_dataset <- dataset(
+  "DiffusionDataset",
+  initialize = function(dataset, image_size, ...) {
+
+    self$image_size <- image_size
+
+    self$transform <- function(x) {
+      img <- x |>
+        transform_to_tensor()
+
+      c(ch, height, width) %<-% img$size()
+      crop_size <- min(height, width)
+
+      img |>
+        transform_center_crop(c(crop_size, crop_size)) |>
+        transform_resize(self$image_size)
+    }
+
+    self$data <- dataset(
+      dir,
+      transform = self$transform,
+      ...
+    )
+  },
+  .getitem = function(i) {
+    self$data[i]
+  },
+  .length = function() {
+    length(self$data)
+  }
+)
+```
+
+For instance, this is called with the pets dataset like so:
+
+```{r}
+train_dataset <- diffusion_dataset(
+  torchdatasets::oxford_pet_dataset,
+  target_type = "species",
+  image_size,
+  split = "train",
+  download = TRUE
+)
+```
+
+The flowers and pets dataset have return for each item a `list()` with an image
+and its target. We will simply ignore the target, as we don't use it when training
+our model. We don't show further details about the dataset, feel free to look
+at the `datasets.R` file for more details.
+
+### Fitting
+
+A simplified version of the code from `train.R` is shown below. Here we setup the 
+model optimizer and set the values for the model and the optimizer hyper-parameters.
+
+```{r}
+model <- diffusion_model |>
+  setup(
+    optimizer = torchopt::optim_adamw
+  ) |>
+  set_hparams(
+    image_size = c(3, 64, 64),
+    block_depth = 2,
+    loss = torch::nn_mae_loss(),
+    widths = c(32, 64, 96, 128),
+    embedding_dim = 32,
+    schedule = diffusion_schedule_config("cosine", 0.02, 0.95),
+    loss_on = "noise"
+  ) |>
+  set_opt_hparams(lr = 1e-4, weight_decay = 1e-4)
+```
+
+We then use `fit` to train the model.
+
+```{r}
+train_dataset <- diffusion_dataset(
+  torchdatasets::oxford_pet_dataset,
+  target_type = "species",
+  image_size,
+  split = "train",
+  download = TRUE
+)
+
+fitted <- model |>
+  fit(
+    dataset,
+    epochs = 50,
+    dataloader_options = list(batch_size = batch_size, num_workers = num_workers)
+  )
+```
+
+Training on the pets dataset takes ~40 minutes using a modern GPU. The model
+is trained for 50 epochs. 
+
+## Sampling images
+
+Once the model has been trained we can sample images by using to reverse the 
+diffusion process. In slightly more formal terms, with our training procedure
+the model learned $p(x_0 | x_t, \bar{\alpha_t})$, i.e. it learned to predict the 
+initial image $x_0$ given $x_t$ (a noisy image) and $\bar{\alpha_t}$ (the 'amount of noise' in the noisy image.)
+
+Now suppose we have a image $x_T$, an image at the time step $T$ - which would be equivalent
+to `diffusion_time=1`. At the `diffusion_time=1` the noise rate is much higher compared
+to the signal rate, thus the image is indistinguishable from pure noise, meaning that
+to sample from the distribution $x_T$ we can just sample random noise.
+
+Given an instance $x_T$ and knowing that $\bar{\alpha_t}$ is 1 we can use our model
+to predict $x_0$ by denoising $x_T$. That prediction of $x_0$ might be reasonable,
+but it works better if instead of denoising in a single model pass, we denoised
+it slowly - just like the forward diffusion process adds noise slowly to the images.
+Luckily, given a prediction of $x_0$ and the noise in $x_T$, it's easy to sample 
+from $x_{T-1}$ as we just need to re-add  the noise with the intensity given by 
+the schedule for that step.
+
+Below we show the code for the reverse diffusion process, which is implemented
+as another method in the `diffusion_model` module. The `diffusion_steps` parameter
+is equivalent to $T$ in the equations above - it represents the number of steps in
+the diffusion process. The diffusion step is mapped to diffusion times by simply
+making $\frac{step}{T}$.
+
+```{r}
+diffusion_model <- nn_module(
+  # ...
+  reverse_diffusion = function(initial_noise, diffusion_steps) {
+    noisy_images <- initial_noise
+
+    diffusion_times <- torch_ones(c(initial_noise$shape[1], 1, 1, 1), device = initial_noise$device)
+    rates <- self$diffusion_schedule(diffusion_times)
+
+    # we want to combine with 'next' value in mind, thus we remove the first
+    # value here
+    for (step in seq(1, 0, length.out = diffusion_steps)[-1]) {
+      c(pred_noises, pred_images) %<-% self$denoise(noisy_images, rates)
+
+      # remix the predicted components using the next signal and noise rates
+      diffusion_times <- torch_ones_like(diffusion_times)*step
+      rates <- self$diffusion_schedule(diffusion_times)
+
+      noisy_images <- rates$signal * pred_images + rates$noise * pred_noises
+    }
+
+    self$normalize$denormalize(pred_images)
+  }
+  # ...
+)
+```
+
+In conclusion, the sampling process does the following:
+
+- Starts with a noisy image that is just random Gaussian noise.
+- Repeats the following for the number of diffusion steps:
+  
+  - Uses the model to 'denoise' the noisy image
+  - Re-creates a noisy image with from the denoised image and a reduced amount of noise specified by the schedule.
+  
+The image below illustrates the process forward diffusion process (first row)
+then from random noise, the reverse diffusion process (second row).
+
+```{r fig.cap="Forward (first row) and reverse (second row) diffusion process for 10 diffusion steps", echo=FALSE, eval=TRUE, layout="1-page", out.height="130px"}
+knitr::include_graphics("https://github.com/dfalbel/denoising-diffusion/raw/main/README_files/figure-commonmark/reverse-1.png")
+```
+
+The `generate` method is implemented a simple wraper around `reverse_diffusion`:
+
+```{r}
+diffusion_model <- nn_module(
+  # ...
+  generate = function(num_images, diffusion_steps = 20) {
+    initial_noise <- torch_randn(c(num_images, self$image_size), device=self$device)
+    self$reverse_diffusion(initial_noise, diffusion_steps = diffusion_steps)
+  }
+  # ...
+)
+```
+
+Images can be generated using `model$generate(num_images = 20)`.
+
+## Measuring performance
+
+Measuring performance of image generation models is not a simple process. In general
+we want the images to be similar from the data distribution it was trained on, but 
+they still need to have some novelty.
+
+There are several metrics that can be used for that end. In this article we used the
+Kernel Inception Distance which is relatively easy to compute and faster than the
+Frechet Inception Distance - another commonly used metric.
+
+The KID metric compares feature representations of the generated and real images. 
+It uses the Inception neural network to extract deep features from both sets of 
+images and computes the [Kernel Maximum Mean Discrepancy (MMD)](https://stats.stackexchange.com/a/276618) 
+between the feature representations.
+
+This metric is not implemented by default, but we can implement a custom luz
+metric to compute the KID. We don't go into details of how KID works and the full
+implementation can be found in the `kid.R` file in the GitHub repository. We have chosen
+to focus on the parts that require technical knowledge of luz.
+
+### luz custom metrics 101
+
+Luz can also be extended by implementing custom metrics. Custom metrics are created using
+the `luz_metric` class constructor. Implementing a custom luz metric requires implementing:
+
+- the `initialize()` method, called at the start of each epoch. It's usually used to initialize
+  buffers, and possibly other states that might be required for the metric computation.
+- `update(preds, target)`: called once per batch iteration. It takes a batch of predictions
+  and targets and is in general used to update buffers for the metric.
+- `compute()`: called at the end of each epoch. Uses the internal state to compute metric values. 
+  This function is called whenever we need to obtain the current metric value. Eg, it’s called every 
+  training step for metrics displayed in the progress bar, but only called once per epoch to record it’s 
+  value when the progress bar is not displayed.
+  
+See the [luz metric documentation](https://mlverse.github.io/luz/reference/luz_metric.html) for more information.
+
+### Implementing KID
+
+Let's assume that `metric_kid_base` is a luz metric that takes a batch of generated 
+images and a batch of real images to compute the KID. 
+
+Note that neither 'preds' or 'targets' are used in the update method. Instead, we
+generate new images using the `generate` method and compare with the original input
+images, that we stored in our custom `step` method.
+
+```{r}
+metric_kid <- luz_metric(
+  "kid_metric",
+  inherit = metric_kid_base,
+  initialize = function(diffusion_steps = 5) {
+    super$initialize()
+    self$diffusion_steps <- diffusion_steps
+  },
+  update = function(preds, target) {
+    ctx$model$eval()
+    with_no_grad({
+      images <- ctx$model$normalize$denormalize(ctx$input)
+      generated_images <- ctx$model$generate(
+        num_images=images$shape[1],
+        diffusion_steps=self$diffusion_steps
+      )
+      super$update(images, generated_images)
+    })
+    ctx$model$train()
+  }
+)
+```
+
+Once a custom metric is implemented, it can be passed to the `setup()` function, 
+just like any other builtin metric. For example, the setup section shown in the 'Fitting' section,
+can be replaced by:
+
+```{r}
+# ...
+model <- diffusion_model |>
+  setup(
+    optimizer = optimizer,
+    metrics = luz_metric_set(
+      metrics = list(image_loss(), noise_loss())
+    )
+  )
+# ...
+```
+
+## Conclusion
+
+This blog post is hopefully a good way to get started with diffusion models using torch.
+We tried to describe as much as possible the code available in [this repository](https://github.com/dfalbel/denoising-diffusion) and we expect that you
+can now visualize how the components connect to each other.
+
+The code repository also contains experiment results conducted using the GuildAI for
+R integration. Please feel free to contact us on [GitHub](https://github.com/dfalbel/denoising-diffusion) 
+if you have questions.
+
+Thank you for reading!
\ No newline at end of file
diff --git a/_posts/2023-04-21-denoising-diffusion-impl/images/book.jpg b/_posts/2023-04-21-denoising-diffusion-impl/images/book.jpg
new file mode 100644
index 00000000..0b3d970e
Binary files /dev/null and b/_posts/2023-04-21-denoising-diffusion-impl/images/book.jpg differ
diff --git a/_posts/2023-04-21-denoising-diffusion-impl/images/overview.png b/_posts/2023-04-21-denoising-diffusion-impl/images/overview.png
new file mode 100644
index 00000000..048fbdb1
Binary files /dev/null and b/_posts/2023-04-21-denoising-diffusion-impl/images/overview.png differ
diff --git a/_posts/2023-04-21-denoising-diffusion-impl/images/reverse-diffusion.png b/_posts/2023-04-21-denoising-diffusion-impl/images/reverse-diffusion.png
new file mode 100644
index 00000000..bb104626
Binary files /dev/null and b/_posts/2023-04-21-denoising-diffusion-impl/images/reverse-diffusion.png differ
diff --git a/_posts/2023-04-21-denoising-diffusion-impl/images/schedules-1.png b/_posts/2023-04-21-denoising-diffusion-impl/images/schedules-1.png
new file mode 100644
index 00000000..36b7aad9
Binary files /dev/null and b/_posts/2023-04-21-denoising-diffusion-impl/images/schedules-1.png differ
diff --git a/_posts/2023-04-21-denoising-diffusion-impl/images/schedules.png b/_posts/2023-04-21-denoising-diffusion-impl/images/schedules.png
new file mode 100644
index 00000000..d7571ab9
Binary files /dev/null and b/_posts/2023-04-21-denoising-diffusion-impl/images/schedules.png differ
diff --git a/_posts/2023-04-21-denoising-diffusion-impl/images/sinusoidal-1.png b/_posts/2023-04-21-denoising-diffusion-impl/images/sinusoidal-1.png
new file mode 100644
index 00000000..e73d0939
Binary files /dev/null and b/_posts/2023-04-21-denoising-diffusion-impl/images/sinusoidal-1.png differ
diff --git a/_posts/2023-04-21-denoising-diffusion-impl/images/squirrel.png b/_posts/2023-04-21-denoising-diffusion-impl/images/squirrel.png
new file mode 100644
index 00000000..882786fe
Binary files /dev/null and b/_posts/2023-04-21-denoising-diffusion-impl/images/squirrel.png differ
diff --git a/_posts/2023-04-21-denoising-diffusion-impl/references.bib b/_posts/2023-04-21-denoising-diffusion-impl/references.bib
new file mode 100644
index 00000000..60557a1e
--- /dev/null
+++ b/_posts/2023-04-21-denoising-diffusion-impl/references.bib
@@ -0,0 +1,114 @@
+
+@misc{kerasDDIM,
+    title = {Denoising Diffusion Implicit Models},
+    url = {https://keras.io/examples/generative/ddim/},
+    author = {Béres, András},
+    year = {2022}
+}
+
+@article{song2020,
+	title = {Denoising Diffusion Implicit Models},
+	author = {Song, Jiaming and Meng, Chenlin and Ermon, Stefano},
+	year = {2020},
+	date = {2020},
+	doi = {10.48550/ARXIV.2010.02502},
+	url = {https://arxiv.org/abs/2010.02502}
+}
+
+@article{nichol2021,
+	title = {Improved Denoising Diffusion Probabilistic Models},
+	author = {Nichol, Alex and Dhariwal, Prafulla},
+	year = {2021},
+	date = {2021},
+	doi = {10.48550/ARXIV.2102.09672},
+	url = {https://arxiv.org/abs/2102.09672}
+}
+
+@article{karras2022,
+	title = {Elucidating the Design Space of Diffusion-Based Generative Models},
+	author = {Karras, Tero and Aittala, Miika and Aila, Timo and Laine, Samuli},
+	year = {2022},
+	date = {2022},
+	doi = {10.48550/ARXIV.2206.00364},
+	url = {https://arxiv.org/abs/2206.00364}
+}
+
+@article{kazemnejad2019:pencoding,
+  title   = "Transformer Architecture: The Positional Encoding",
+  author  = "Kazemnejad, Amirhossein",
+  journal = "kazemnejad.com",
+  year    = "2019",
+  url     = "https://kazemnejad.com/blog/transformer_architecture_positional_encoding/"
+}
+
+@article{sohl-dickstein2015,
+	title = {Deep Unsupervised Learning using Nonequilibrium Thermodynamics},
+	author = {Sohl-Dickstein, Jascha and Weiss, Eric A. and Maheswaranathan, Niru and Ganguli, Surya},
+	year = {2015},
+	date = {2015},
+	doi = {10.48550/ARXIV.1503.03585},
+	url = {https://arxiv.org/abs/1503.03585}
+}
+
+@article{weng2021diffusion,
+  title   = "What are diffusion models?",
+  author  = "Weng, Lilian",
+  journal = "lilianweng.github.io",
+  year    = "2021",
+  month   = "Jul",
+  url     = "https://lilianweng.github.io/posts/2021-07-11-diffusion-models/"
+}
+
+@article{bansal2022,
+	title = {Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise},
+	author = {Bansal, Arpit and Borgnia, Eitan and Chu, Hong-Min and Li, Jie S. and Kazemi, Hamid and Huang, Furong and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom},
+	year = {2022},
+	date = {2022},
+	doi = {10.48550/ARXIV.2208.09392},
+	url = {https://arxiv.org/abs/2208.09392}
+}
+
+@article{ho2020,
+	title = {Denoising Diffusion Probabilistic Models},
+	author = {Ho, Jonathan and Jain, Ajay and Abbeel, Pieter},
+	year = {2020},
+	date = {2020},
+	doi = {10.48550/ARXIV.2006.11239},
+	url = {https://arxiv.org/abs/2006.11239}
+}
+
+@article{ronneberger2015,
+	title = {U-Net: Convolutional Networks for Biomedical Image Segmentation},
+	author = {Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas},
+	year = {2015},
+	date = {2015},
+	doi = {10.48550/ARXIV.1505.04597},
+	url = {https://arxiv.org/abs/1505.04597}
+}
+
+@article{he2015,
+	title = {Deep Residual Learning for Image Recognition},
+	author = {He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
+	year = {2015},
+	date = {2015},
+	doi = {10.48550/ARXIV.1512.03385},
+	url = {https://arxiv.org/abs/1512.03385}
+}
+
+@article{vaswani2017,
+	title = {Attention Is All You Need},
+	author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia},
+	year = {2017},
+	date = {2017},
+	doi = {10.48550/ARXIV.1706.03762},
+	url = {https://arxiv.org/abs/1706.03762}
+}
+
+@article{binkowski2018,
+	title = {Demystifying MMD GANs},
+	author = {{Bi{\'{n}}kowski}, {Miko{\l}aj} and Sutherland, Danica J. and Arbel, Michael and Gretton, Arthur},
+	year = {2018},
+	date = {2018},
+	doi = {10.48550/ARXIV.1801.01401},
+	url = {https://arxiv.org/abs/1801.01401}
+}