Introduction to Regularized Maximum Likelihood Imaging#

This document is an attempt to provide a whirlwind introduction to what Regularized Maximum Likelihood (RML) imaging is, and why you might want to use this MPoL package to perform it with your interferometric dataset. Of course, the field is rich, varied, and this short introduction couldn’t possibly do justice to cover the topic in depth. We recommend that you check out many of the links and suggestions in this document for further reading and understanding.

Introduction to Likelihood functions#

Typically, when astronomers fit a model to some dataset, such as a line \(y = m x + b\) to a collection of \(\boldsymbol{X} = \{x_1, x_2, \ldots\, x_N\}\) and \(\boldsymbol{Y} = \{y_1, y_2, \ldots\, y_N\}\) points, we require a likelihood function. Simply put, the likelihood function specifies the probability of the data, given a model, and encapsulates our assumptions about the data and noise generating processes.

For most real-world datasets, we don’t measure the “true” \(y\) value of the line (i.e., \(mx + b\)), but rather make a measurement which has been partially corrupted by some “noise.” In that case, we say that each \(y_i\) data point is actually generated by

\[ y_i = m x_i + b + \epsilon \]

where \(\epsilon\) is a noise realization from a standard normal distribution with standard deviation \(\sigma\), i.e.,

\[ \epsilon \sim \mathcal{N}(0, \sigma). \]

This information about the data and noise generating process means that we can write down a likelihood function to calculate the probability of the data, given a set of model parameters. The likelihood function is \(p(\boldsymbol{Y} |\,\boldsymbol{\theta})\). Sometimes it is written as \(\mathcal{L}(\boldsymbol{\theta}; \boldsymbol{Y})\) to emphasize that it is in fact a function (note, not probability) of the model parameters, and that it takes the data as an argument. Frequently, when employed in computation, we’ll use the logarithm of the likelihood function, or “log-likelihood,” \(\ln \mathcal{L}\) to avoid numerical under/overflow issues. Let’s call \(\boldsymbol{\theta} = \{m, b\}\) and \(M(x_i |\, \boldsymbol{\theta}) = m x_i + b\) to emphasize that we can use the model \(M\) to calculate a response at input \(x_i\) conditional on model parameters \(\boldsymbol{\theta}\). The likelihood function for this line problem is

\[ \mathcal{L}(\boldsymbol{\theta}; \boldsymbol{Y}) = \prod_i^N \frac{1}{\sqrt{2 \pi} \sigma} \exp \left [ - \frac{(y_i - M(x_i |\,\boldsymbol{\theta}))^2}{2 \sigma^2}\right ] \]

The logarithm of the likelihood function is

\[ \ln \mathcal{L}(\boldsymbol{\theta}; \boldsymbol{Y}) = -N \ln(\sqrt{2 \pi} \sigma) - \frac{1}{2} \sum_i^N \frac{(y_i - M(x_i |\,\boldsymbol{\theta}))^2}{\sigma^2} \]

You may recognize the right hand term looks similar to the \(\chi^2\) metric,

\[ \chi^2(\boldsymbol{\theta}; \boldsymbol{Y}) = \sum_i^N \frac{(y_i - M(x_i |\,\boldsymbol{\theta}))^2}{\sigma^2} \]

Assuming that the uncertainty (\(\sigma\)) on each data point is known (and remains constant), the first term in the log likelihood expression remains constant, and we have

\[ \ln \mathcal{L}(\boldsymbol{\theta}; \boldsymbol{Y}) = - \frac{1}{2} \chi^2 (\boldsymbol{\theta}; \boldsymbol{Y}) + C \]

where \(C\) is a constant with respect to the model parameters. It is common to use shorthand to say that “the likelihood function is \(\chi^2\)” to indicate situations where the data uncertainties are Gaussian. Very often, we (or others) are interested in the parameter values \(\boldsymbol{\theta}_\mathrm{MLE}\) which maximize the likelihood function. Unsurprisingly, these parameters are called the maximum likelihood estimate (or MLE), and usually they represent something like a “best-fit” model. [1]

When it comes time to do parameter inference, however, it’s important to keep in mind

  1. the simplifying assumptions we made about the noise uncertainties being constant with respect to the model parameters. If we were to “fit for the noise” in a hierarchical model, for example, we would need to use the full form of the log-likelihood function, including the \(-N \ln \left (\sqrt{2 \pi} \sigma \right)\) term.

  2. that in order to maximize the likelihood function we want to minimize the \(\chi^2\) function.

  3. that constants of proportionality (e.g., the \(1/2\) in front of the \(\chi^2\)) can matter when combining likelihood functions with prior distributions for Bayesian parameter inference. We’ll have more to say on this in a second when we talk about regularizers and their strengths.

To be specific, \(\chi^2\) is not the end of the story when we’d like to perform Bayesian parameter inference. To do so, we need the posterior probability distribution of the model parameters given the dataset, \(p(\boldsymbol{\theta}|\,\boldsymbol{Y})\). We can calculate this quantity using Bayes rule

\[ p(\boldsymbol{\theta}|\,\boldsymbol{Y}) = \frac{p(\boldsymbol{Y}|\,\boldsymbol{\theta})\, p(\boldsymbol{\theta})}{p(\boldsymbol{Y})} \]

The denominator is a constant so long as the model specification remains the same, leaving

\[ p(\boldsymbol{\theta}|\,\boldsymbol{Y}) \propto p(\boldsymbol{Y}|\,\boldsymbol{\theta})\, p(\boldsymbol{\theta}). \]

So we need a prior probability distribution \(p(\boldsymbol{\theta})\) in addition to the likelihood function to calculate the posterior probability distribution of the model parameters. Analogous to the maximum likelihood estimate, there is also the maximum a posteriori estimate (or MAP), which includes the effect of the prior probability distribution.

See also

Useful resources on Bayesian inference include

Data in the Fourier domain#

MPoL is a package to make images from interferometric data. Currently, we are most focused on modeling datasets from radio interferometers like the Atacama Large Millimeter Array (ALMA), so the following introduction will have a radio astronomy flavor to it. But the concept of forward modeling interferometric data is quite general, and with a few additions the MPoL package could be applied to imaging problems involving Fourier data from optical and infrared telescopes (if this describes your dataset, please get in touch).

As astronomers, we are most often interested in characterizing what an astrophysical source looks like. In other words, how its surface brightness \(I\) changes as a function of sky position. However, intereferometers acquire samples of data in the Fourier domain, also called the visibility domain. The visibility domain is the Fourier transform of the image sky brightness

\[ {\cal V}(u,v) = \iint I(l,m) \exp \left \{- 2 \pi i (ul + vm) \right \} \, \mathrm{d}l\,\mathrm{d}m, \]

where \(l\) and \(m\) are direction cosines (roughly equivalent to R.A. and Dec) which parameterize the surface brightness distribution of the image \(I(l,m)\), and \(u\) and \(v\) are spatial frequencies which parameterize the visibility function \(\cal{V}(u,v)\). For more information on the meaning of these units, see Units and Conventions.

The visibility function is complex-valued, and each measurement of it (denoted by \(V_i\)) is made in the presence of noise

\[ V_i = \mathcal{V}(u_i, v_i) + \epsilon. \]

Here \(\epsilon\) represents a noise realization from a complex normal (Gaussian) distribution. Thankfully, most interferometric datasets do not exhibit significant covariance between the real and imaginary noise components, so we could equivalently say that the real and imaginary components of the noise are separately generated by draws from normal distributions characterized by standard deviation \(\sigma\)

\[\begin{split} \epsilon_\Re \sim \mathcal{N}(0, \sigma) \\ \epsilon_\Im \sim \mathcal{N}(0, \sigma) \end{split}\]

and

\[ \epsilon = \epsilon_\Re + i \epsilon_\Im \]

Radio interferometers will commonly represent the uncertainty on each visibility measurement by a “weight” \(w_i\), where

\[ w_i = \frac{1}{\sigma_i^2} \]

A full interferometric dataset is a collection of visibility measurements, which we represent by

\[ \boldsymbol{V} = \{V_1, V_2, \ldots \}_{i=1}^N \]

For reference, a typical ALMA dataset might contain a half-million individual visibility samples, acquired over a range of spatial frequencies.

See also

A full introduction to Fourier transforms, radio astronomy, and interferometry is beyond the scope of this introduction. However, here are some additional resources that we recommend checking out.

Likelihood functions for Fourier data#

Now that we’ve introduced likelihood functions in general and the specifics of Fourier data, let’s talk about likelihood functions for inference with Fourier data. As before, our statement about the data generating process

\[ V_i = \mathcal{V}(u_i, v_i) + \epsilon \]

leads us to the formulation of the likelihood function.

First, let’s assume we have some model that we’d like to fit to our dataset. To be a forward model, it should be able to predict the value of the visibility function for any spatial frequency, i.e., we need to be able to calculate \(\mathcal{V}(u, v) = M_\mathcal{V}(u, v |, \boldsymbol{\theta})\).

It’s difficult to reason about all but the simplest models directly in the Fourier plane, so usually models are constructed in the image plane \(M_I(l,m |,\boldsymbol{\theta})\) and then Fourier transformed (either analytically, or via the FFT) to construct visibility models \(M_\mathcal{V}(u, v |, \boldsymbol{\theta}) \leftrightharpoons M_I(l,m |,\boldsymbol{\theta})\). For example, these models could be channel maps of carbon monoxide emission from a rotating protoplanetary disk (as in Czekala et al. 2015, where \(\boldsymbol{\theta}\) contains parameters setting the structure of the disk), or rings of continuum emission from a protoplanetary disk (as in Guzmán et al. 2018, where \(\boldsymbol{\theta}\) contains parameters setting the sizes and locations of the rings).

Following the discussion about how the complex noise realization \(\epsilon\) is generated, this leads to a log likelihood function

\[ \ln \mathcal{L}(\boldsymbol{\theta}; \boldsymbol{V}) = - \frac{1}{2} \chi^2(\boldsymbol{\theta}; \boldsymbol{V}) + C \]

Because the data and model are complex-valued, \(\chi^2\) is evaluated as

\[ \chi^2(\boldsymbol{\theta}; \boldsymbol{V}) = \sum_i^N \frac{|V_i - M_\mathcal{V}(u_i, v_i |\,\boldsymbol{\theta})|^2}{\sigma_i^2} \]

where \(| |\) denotes the modulus squared. Equivalently, the calculation can be broken up into sums over the real (\(\Re\)) and imaginary (\(\Im\)) components of the visibility data and model

\[ \chi^2(\boldsymbol{\theta}; \boldsymbol{V}) = \sum_i^N \frac{(V_{\Re,i} - M_\mathcal{V,\Re}(u_i, v_i |\,\boldsymbol{\theta}))^2}{\sigma_i^2} + \sum_i^N \frac{(V_{\Im,i} - M_\mathcal{V,\Im}(u_i, v_i |\,\boldsymbol{\theta}))^2}{\sigma_i^2} \]

Now with the likelihood function specified, we can add prior probability distributions \(p(\boldsymbol{\theta})\), and calculate and explore the posterior probability distribution of the model parameters using algorithms like Markov Chain Monte Carlo. In this type of Bayesian inference, we’re usually using forward models constructed with a small to medium number of parameters (e.g., 10 - 30), like in the protoplanetary disk examples of Czekala et al. 2015 or Guzmán et al. 2018.

Note

Even though we would say that “traditional” Bayesian parameter inference is not the main focus of RML algorithms, it is entirely possible with the MPoL package. In fact, the gradient-based nature of the MPoL package (discussed in a moment) can make posterior exploration very fast using efficient Hamiltonian Monte Carlo samplers.

Note

The \(\chi^2\) likelihood function as formulated above is appropriate for visibilities with minimal spectral covariance. When modeling spectral line datasets, in particular those that have not been channel-averaged and retain the spectral response function from their Hann windowing, this covariance must be taken into account in the likelihood function. More information on how to derive these covariance matrices is provided in the appendices of Loomis et al. 2018 and will be detailed in forthcoming tutorials.

RML images as non-parametric models#

Now that we’ve introduced what it means to forward-model a dataset and how to calculate a likelihood function, let’s talk about non-parametric models.

Say that our \(\boldsymbol{X} = \{x_1, x_2, \ldots\, x_N\}\) and \(\boldsymbol{Y} = \{y_1, y_2, \ldots\, y_N\}\) dataset looked a bit more structured than a simple \(y = mx + b\) relationship. We could expand the model by adding more parameters, for example, by adding quadratic and cubic terms, e.g., \(y = a_0 + a_1 x + a_2 x^2 + a_3 x^3\). This would be a reasonable approach, especially if the parameters \(a_2\), \(a_3\), etc… had physical meaning. But if all that we’re interested in is modeling the relationship between \(y = f(x)\) in order to make predictions, we could just as easily use a non-parametric model, like a spline or a Gaussian process.

With RML imaging, we’re trying to come up with a model that will fit the dataset. But rather than using a parametric model like a protoplanetary disk structure model or a series of Gaussian rings, we’re using a non-parametric model of the image itself. This could be as simple as parameterizing the image using the intensity values of the pixels themselves, i.e.,

\[ \boldsymbol{\theta} = \{I_1, I_2, \ldots, I_{N^2} \} \]

assuming we have an \(N \times N\) image.

Note

RML imaging is different from CLEAN imaging, which operates as a deconvolution procedure in the image plane. At least at sub-mm and radio wavelengths, CLEAN is by far the dominant algorithm used to synthesize images from interferometric data. Therefore, if you’re interested in RML imaging, it’s worth first understanding the basics of the CLEAN algorithm.

Here are some useful resources on the CLEAN algorithm.

A flexible image model for RML imaging is mostly analogous to using a spline or Gaussian process to fit a series of \(\boldsymbol{X} = \{x_1, x_2, \ldots\, x_N\}\) and \(\boldsymbol{Y} = \{y_1, y_2, \ldots\, y_N\}\) points—the model will nearly always have enough flexibility to capture the structure that exists in the dataset. The most straightforward formulation of a non-parametric image model is the pixel basis set, but we could also use more sophisticated basis sets like a set of wavelet coefficients, or even more exotic basis sets constructed from trained neural networks. These may have some serious advantages when it comes to the “regularizing” part of “regularized maximum likelihood” imaging. But first, let’s talk about the “maximum likelihood” part.

Given some image parameterization (e.g., a pixel basis set of \(N \times N\) pixels, with each pixel cell_size in width), we would like to find the maximum likelihood image \(\boldsymbol{\theta}_\mathrm{MLE}\). Fortunately, because the Fourier transform is a linear operation, we can analytically calculate the maximum solution (the same way we might find the best-fit slope and intercept for the line example). This maximum likelihood solution is called (in the radio astronomy world) the dirty image, and its associated point spread function is called the dirty beam.

In the construction of the dirty image, all unsampled spatial frequencies are set to zero power. This means that the dirty image will only contain spatial frequencies about which we have at least some data. This assumption, however, rarely translates into good image fidelity, especially if there are many unsampled spatial frequencies which carry significant power. It’s also important to recognize that dirty image is only one out of a set of many images that could maximize the likelihood function. From the perspective of the likelihood calculation, we could modify the unsampled spatial frequencies of the dirty image to whatever power we might like, and, because they are unsampled, the value of the likelihood calculation won’t change, i.e., it will still remain maximal.

When synthesis imaging is described as an “ill-posed inverse problem,” this is what is meant. There is a (potentially infinite) range of images that could exactly fit the dataset, and without additional information we have no way of discriminating which is best. As you might suspect, this is now where the “regularization” part of “regularized maximum likelihood” imaging comes in.

There are a number of different ways to talk about regularization. If one wants to be Bayesian about it, one would talk about specifying priors, i.e., we introduce terms like \(p(\boldsymbol{\theta})\) such that we might calculate the maximum a posteriori (MAP) image \(\boldsymbol{\theta}_\mathrm{MAP}\) using the posterior probability distribution

\[ p(\boldsymbol{\theta} |\, \boldsymbol{V}) \propto \mathcal{L}(\boldsymbol{\theta}; \boldsymbol{V}) \, p(\boldsymbol{\theta}). \]

For computational reasons related to numerical over/underflow, we would most likely use the logarithm of the posterior probability distribution

\[ \ln p(\boldsymbol{\theta} |\, \boldsymbol{V}) \propto \ln \mathcal{L}( \boldsymbol{\theta}; \boldsymbol{V}) + \ln p(\boldsymbol{\theta}). \]

One could accomplish the same goal without necessarily invoking the Bayesian language by simply talking about which parameters \(\boldsymbol{\theta}\) optimize some objective function.

We’ll adopt the perspective that we have some objective “cost” function that we’d like to minimize to obtain the optimal parameters \(\hat{\boldsymbol{\theta}}\). The machine learning community calls this a “loss” function \(L(\boldsymbol{\theta})\), and so we’ll borrow that terminology here. For an unregularized fit, an acceptable loss function is just the negative log likelihood (“nll”) term,

\[ L(\boldsymbol{\theta}) = L_\mathrm{nll}(\boldsymbol{\theta}) = - \ln \mathcal{L}(\boldsymbol{\theta}; \boldsymbol{V}) = \frac{1}{2} \chi^2(\boldsymbol{\theta}; \boldsymbol{V}) \]

If we’re only interested in \(\hat{\boldsymbol{\theta}}\), it doesn’t matter whether we include the \(1/2\) prefactor in front of \(\chi^2\), the loss function will still have the same optimum. However, when it comes time to add additional terms to the loss function, these prefactors matter in controlling the relative strength of each term.

When phrased in the terminology of function optimization, additional terms can be described as regularization penalties. To be specific, let’s add a term that regularizes the sparsity of an image.

\[ L_\mathrm{sparsity}(\boldsymbol{\theta}) = \sum_i |I_i| \]

This prior is described in more detail in the API documentation. In short, the L1 norm promotes sparse solutions (solutions where many pixel values are zero). The combination of these two terms leads to a new loss function

\[ L(\boldsymbol{\theta}) = L_\mathrm{nll}(\boldsymbol{\theta}) + \lambda_\mathrm{sparsity} L_\mathrm{sparsity}(\boldsymbol{\theta}) \]

Where we control the relative “strength” of the regularization via the scalar prefactor \(\lambda_\mathrm{sparsity}\). If \(\lambda_\mathrm{sparsity} = 0\), no sparsity regularization is applied. Non-zero values of \(\lambda_\mathrm{sparsity}\) will add in regularization that penalizes non-sparse \(\boldsymbol{\theta}\) values. How strong this penalization is depends on the strength relative to the other terms in the loss calculation. [2]

We can equivalently specify this using Bayesian terminology, such that

\[ p(\boldsymbol{\theta} |\,\boldsymbol{V}) = \mathcal{L}(\boldsymbol{\theta}; \boldsymbol{V}) \, p(\boldsymbol{\theta}) \]

where

\[ p(\boldsymbol{\theta}) = C \exp \left (-\lambda_\mathrm{sparsity} \sum_i | I_i| \right) \]

and \(C\) is a normalization factor. When working with the logarithm of the posterior, this constant term is irrelevant.

See also

That’s RML imaging in a nutshell, but we’ve barely scratched the surface. We highly recommend checking out the following excellent resources.

The MPoL package for Regularized Maximum Likelihood imaging#

Million Points of Light or “MPoL” is a Python package that is used to perform regularized maximum likelihood imaging. By that we mean that the package provides the building blocks to create flexible image models and optimize them to fit interferometric datasets. The package is developed completely in the open on Github.

We strive to

  • create an open, welcoming, and supportive community for new users and contributors (see our code of conduct and developer documentation)

  • support well-tested and stable releases (i.e., pip install mpol) that run on all currently-supported Python versions, on Linux, MacOS, and Windows

  • maintain up-to-date API documentation

  • cultivate tutorials covering real-world applications

See also

We also recommend checking out several other excellent packages for RML imaging:

There are a few things about MPoL that we believe make it an appealing platform for RML modeling.

Built on PyTorch: Many of MPoL’s exciting features stem from the fact that it is built on top of a rich computational library that supports autodifferentiation and construction of complex neural networks. Autodifferentiation libraries like Theano/Aesara, Tensorflow, PyTorch, and JAX have revolutionized the way we compute and optimize functions. For now, PyTorch is the library that best satisfies our needs, but we’re keeping a close eye on the Python autodifferentiation ecosystem should a more suitable framework arrive. If you are familiar with scientific computing with Python but haven’t yet tried any of these frameworks, don’t worry, the syntax is easy to pick up and quite similar to working with numpy arrays. For example, check out our tutorial introduction to PyTorch.

Autodifferentiation: PyTorch gives MPoL the capacity to autodifferentiate through a model. The gradient of the objective function is exceptionally useful for finding the “downhill” direction in a large parameter space (such as the set of image pixels). Traditionally, these gradients would have needed to been calculated analytically (by hand) or via finite-difference methods which can be noisy in high dimensions. By leveraging the autodifferentiation capabilities, this allows us to rapidly formulate and implement complex prior distributions which would otherwise be difficult to differentiate by hand.

Optimization: PyTorch provides a full-featured suite of research-grade optimizers designed to train deep neural networks. These same optimizers can be employed to quickly find the optimum RML image.

GPU acceleration: PyTorch wraps CUDA libraries, making it seamless to take advantage of (multi-)GPU acceleration to optimize images. No need to use a single line of CUDA.

Model composability: Rather than being a monolithic program for single-click RML imaging, MPoL strives to be a flexible, composable, RML imaging library that provides primitives that can be used to easily solve your particular imaging challenge. One way we do this is by mimicking the PyTorch ecosystem and writing the RML imaging workflow using PyTorch modules. This makes it easy to mix and match modules to construct arbitrarily complex imaging workflows. We’re working on tutorials that describe these ideas in depth, but one example would be the ability to use a single latent space image model to simultaneously fit single dish and interferometric data.

A bridge to the machine learning/neural network community: MPoL will happily calculate RML images for you using “traditional” image priors, lest you are the kind of person that turns your nose up at the words “machine learning” or “neural network.” However, if you are the kind of person that sees opportunity in these tools, because MPoL is built on PyTorch, it is straightforward to take advantage of them for RML imaging. For example, if one were to train a variational autoencoder on protoplanetary disk emission morphologies, the latent space + decoder architecture could be easily plugged in to MPoL and serve as an imaging basis set.

To get started with MPoL, we recommend installing the package and reading through the tutorial series. If you have any questions about the package, we invite you to join us on our Github discussions page.

Footnotes