Losses#
We have separated the loss functions into two categories: those involving data (derived from likelihood functions) and those acting as regularizers (or priors). We briefly review likelihood functions and their application to Fourier data.
Review#
Reference Literature
If you are new to Bayesian inference or its notation, we recommend reviewing the following excellent resources as a prerequisite to the following discussion.
Eadie et al. 2023: Practical Guidance for Bayesian Inference in Astronomy
Data Analysis: A Bayesian Tutorial by Sivia and Skilling
Data analysis recipes: Fitting a model to data by Hogg, Bovy, and Lang
Data analysis recipes: Probability calculus for inference by Hogg.
Fitting a line example#
Typically, when astronomers fit a model to some dataset, such as a line \(y = m x + b\) to a collection of \(\boldsymbol{X} = \{x_1, x_2, \ldots\, x_N\}\) and \(\boldsymbol{Y} = \{y_1, y_2, \ldots\, y_N\}\) points, we require a likelihood function. The likelihood function specifies the probability of the data, given a model, and encapsulates our assumptions about the data and noise generating processes.
For most real-world datasets, we don’t measure the “true” \(y\) value of the line (i.e., \(mx + b\)), but rather make a measurement which has been partially corrupted by some “noise.” We say that each \(y_i\) data point is actually generated by
where \(\epsilon_i\) is a noise realization from a standard normal distribution with standard deviation \(\sigma_i\), i.e.,
This information allows us to write down a likelihood function to calculate the probability of the data, given a set of model parameters \(p(\boldsymbol{Y} |\,\boldsymbol{\theta})\). Sometimes it is written as \(\mathcal{L}(\boldsymbol{\theta}; \boldsymbol{Y})\) to emphasize that it is in fact a function (note, not probability) of the model parameters, and that it takes the data as an argument. Frequently, when employed in computation, we’ll use the logarithm of the likelihood function, or “log-likelihood,” \(\ln \mathcal{L}\) to avoid numerical under/overflow issues. Let’s call \(\boldsymbol{\theta} = \{m, b\}\) and \(M(x_i |\, \boldsymbol{\theta}) = m x_i + b\) to emphasize that we can use the model \(M\) to calculate a response at input \(x_i\) conditional on model parameters \(\boldsymbol{\theta}\). The likelihood function for this line problem is
The logarithm of the likelihood function is
You may recognize the last term contains the \(\chi^2\) metric,
Assuming that the uncertainty (\(\sigma_i\)) on each data point is known (and remains constant), the first two terms remain constant and we have
where \(C\) is a constant with respect to the model parameters. It is common to use shorthand to say that “the likelihood function is \(\chi^2\)” to indicate situations where the data uncertainties are Gaussian. Very often, we are interested in the parameter values \(\boldsymbol{\theta}_\mathrm{MLE}\) which maximize the likelihood function, called the maximum likelihood estimate (or MLE).
\(\chi^2\) is not the end of the story for Bayesian parameter inference. To do so, we need the posterior probability distribution of the model parameters given the dataset, \(p(\boldsymbol{\theta}|\,\boldsymbol{Y})\). We can calculate this quantity using Bayes rule
The denominator is a constant so long as the model specification remains the same, leaving
So we need a prior probability distribution \(p(\boldsymbol{\theta})\) in addition to the likelihood function to calculate the posterior probability distribution of the model parameters. Analogous to the maximum likelihood estimate, there is also the maximum a posteriori estimate (or MAP), which includes the effect of the prior probability distribution.
Fourier data#
Reference Literature
A full introduction to Fourier transforms, radio astronomy, and interferometry is beyond the scope of MPoL documentation. We recommend reviewing these resources as a prerequisite to the discussion.
Essential radio astronomy textbook by James Condon and Scott Ransom, and in particular, Chapter 3.7 on Radio Interferometry.
NRAO’s 17th Synthesis Imaging Workshop recorded lectures and slides available
Interferometry and Synthesis in Radio Astronomy by Thompson, Moran, and Swenson. An excellent and comprehensive reference on all things interferometry.
NJIT’s online course materials for Radio Astronomy
Ian Czekala’s lecture notes on Radio Interferometry and Imaging
Interferometers acquire samples of data in the Fourier domain, also called the visibility domain. The visibility domain is the Fourier transform of the image sky brightness
where \(l\) and \(m\) are direction cosines (roughly equivalent to R.A. and Dec) which parameterize the surface brightness distribution of the image \(I(l,m)\), and \(u\) and \(v\) are spatial frequencies which parameterize the visibility function \(\cal{V}(u,v)\). For more information on the meaning of these units, see Units and Conventions.
The visibility function is complex-valued, and each measurement of it (denoted by \(V_i\)) is made in the presence of noise
where \(\epsilon_i\) represents a noise realization from a complex normal (Gaussian) distribution. Thankfully, most interferometric datasets do not exhibit significant covariance between the real and imaginary noise components, so we could equivalently say that the real and imaginary components of the noise are separately generated by draws from normal distributions characterized by standard deviation \(\sigma_i\)
and
Radio interferometers will commonly represent the uncertainty on each visibility measurement by a “weight” \(w_i\), where
A full interferometric dataset is a collection of visibility measurements, which we represent by
A typical ALMA dataset might contain a half-million individual visibility samples, acquired over a range of spatial frequencies.
Assume we have some forward model that can predict the value of the visibility function for any spatial frequency, \(\mathcal{V}(u, v) = M_\mathcal{V}(u, v |, \boldsymbol{\theta})\). It’s difficult to reason about all but the simplest models directly in the Fourier plane, so usually models are constructed in the image plane \(M_I(l,m |,\boldsymbol{\theta})\) and then Fourier transformed (either analytically, or via the FFT) to construct visibility models \(M_\mathcal{V}(u, v |, \boldsymbol{\theta}) \leftrightharpoons M_I(l,m |,\boldsymbol{\theta})\).
As with the line example, our statement about the data generating process
leads to the formulation of the likelihood function
Because the data and model are complex-valued, \(\chi^2\) is evaluated as
where \(| |\) denotes the modulus squared. Equivalently, the calculation can be broken up into sums over the real (\(\Re\)) and imaginary (\(\Im\)) components of the visibility data and model
Spectral covariance
The \(\chi^2\) likelihood function as formulated above is appropriate for visibilities with minimal spectral covariance. When modeling spectral line datasets, in particular those that have not been channel-averaged and retain the spectral response function from their Hann windowing, this covariance must be taken into account in the likelihood function. More information on how to derive these covariance matrices is provided in the appendices of Loomis et al. 2018 and will be detailed in forthcoming tutorials.
Hermitian visibilities
Because the sky brightness \(I_\nu\) is real, the visibility function \(\mathcal{V}\) is Hermitian, meaning that
Most datasets (e.g., those extracted from CASA) will only record one visibility measurement per baseline and not include the duplicate Hermitian pair (to save storage space). We recommend that you evaluate all data loss functions without the Hermitian pair.
For the likelihood function calculation to be accurate with Hermitian pairs, the simple \(\chi^2\) sum of data - model would need to be replaced with a multivariate Gaussian likelihood that included a perfect covariance between each data point and its Hermitian pair. This complicates the calculation unnecessarily.
The one case where Hermitian pairs do need to be included is when using the inverse FFT. This applies to the mpol.gridding.DirtyImager()
and is handled internally.
You can check whether your dataset already includes Hermitian pairs using mpol.gridding.verify_no_hermitian_pairs()
.
Averaged loss functions#
In an optimization workflow, we usually minimize a loss function \(L\) rather than maximize the log likelihood function. If we are just optimizing (instead of sampling), we only care about the value of \(\hat{\boldsymbol{\theta}}\) that minimizes the function \(L\). The normalization of \(L\) does not matter, we only care that \(L(\hat{\boldsymbol{\theta}}) < L(\hat{\boldsymbol{\theta}} + \varepsilon)\), not by how much. In these applications we recommend using an averaged data loss function, whose value remains approximately constant as the size of the dataset varies.
In the most common scenario where you are keeping data and weights fixed, we recommend using a reduced \(\chi_R^2\) data loss function, available in either mpol.losses.r_chi_squared()
or mpol.losses.r_chi_squared_gridded()
. The hope is that for many applications, the reduced \(\chi^2_R\) loss function will have a minimum value of \(\chi^2_R(\hat{\boldsymbol{\theta}}) \approx 1\) for a well-fit model (regardless of the number of data points).
In the situation where you may be modifying the data and weights (as in a self-calibration workflow), we recommend using the mpol.losses.neg_log_likelihood_avg()
loss function.
Regularization#
With RML imaging, we’re trying to come up with a model that will fit the dataset. But rather than using a parametric model like a series of Gaussian rings for a protoplanetary disk, we’re using a non-parametric model of the image itself. This could be as simple as parameterizing the image using the intensity values of the pixels themselves, i.e.,
assuming we have an \(N \times N\) image.
This flexible image model is analogous to using a spline or Gaussian process to fit a series of points \(\boldsymbol{Y} = \{y_1, y_2, \ldots\, y_N\}\)—the model will nearly always have enough flexibility to capture the structure that exists in the dataset. The pixel basis set is the most straightforward non-parametric image model, but we could also use more sophisticated basis sets like a set of wavelet coefficients, or even more exotic basis sets constructed from trained neural networks.
Because the Fourier transform is a linear operation with respect to the pixel basis, the maximum likelihood model image (called the dirty image) can be calculated analytically by the inverse Fourier transform. The point spread function of the dirty image is called the dirty beam. By construction, all unsampled spatial frequencies are set to zero power. This means that the dirty image will only contain spatial frequencies about which we have at least some data. This assumption, however, rarely translates into good image fidelity, especially if there are many unsampled spatial frequencies which carry significant power. It’s also important to recognize that dirty image is only one out of a set of many images that could maximize the likelihood function. From the perspective of the likelihood calculation, we could modify the unsampled spatial frequencies of the dirty image to whatever power we might like, and, because they are unsampled, the value of the likelihood calculation won’t change, i.e., it will still remain maximal.
When synthesis imaging is described as an “ill-posed inverse problem,” this is what is meant. There is a (potentially infinite) range of images that could exactly fit the dataset, and without additional information we have no way of discriminating which is best. This is where “regularization” comes in.
One can talk about regularization from a Bayesian perspective using priors, i.e., we introduce terms like \(p(\boldsymbol{\theta})\) such that we might calculate the maximum a posteriori (MAP) image \(\boldsymbol{\theta}_\mathrm{MAP}\) using the posterior probability distribution
For computational reasons related to numerical over/underflow, we would most likely use the logarithm of the posterior probability distribution
One could also describe the optimization processes without Bayesian terminology as minimizing an objective loss function comprising data loss functions and regularization penalties, e.g.,
The relative “strength” of the regularization is controlled via a scalar prefactor \(\lambda\), internal to each loss function. This is one situation where a normalized data loss function is useful, because it preserves the relative strength of the regularizer even if the dataset (or mini-batches of it, in a stochastic gradient descent setting) change size.
Data loss function API#
- mpol.losses.r_chi_squared(model_vis: Tensor, data_vis: Tensor, weight: Tensor) Tensor [source]#
Calculate the reduced \(\chi^2_\mathrm{R}\) between the complex data \(\boldsymbol{V}\) and model \(M\) visibilities using
\[\chi^2_\mathrm{R} = \frac{1}{2 N} \chi^2(\boldsymbol{\theta};\,\boldsymbol{V})\]where \(\chi^2\) is evaluated using private function
mpol.losses._chi_squared()
. Data and model visibilities may be any shape as long as all tensors (including weight) have the same shape. Following EHT-IV 2019, we apply a prefactor \(1/(2 N)\), where \(N\) is the number of visibilities. The factor of 2 comes in because we must count real and imaginaries in the \(\chi^2\) sum. This loss function will have a minimum value of \(\chi^2_\mathrm{R}(\hat{\boldsymbol{\theta}};\,\boldsymbol{V}) \approx 1\) for a well-fit model (regardless of the number of data points), making it easier to set the prefactor strengths of other regularizers relative to this value.Note that this function should only be used in an optimization or point estimate situation and where you are not adjusting the weight or the amplitudes of the data values. If it is used in any situation where uncertainties on parameter values are determined (such as Markov Chain Monte Carlo), it will return the wrong answer. This is because the relative scaling of \(\chi^2_\mathrm{R}\) with respect to parameter value is incorrect. For those applications, you should use
mpol.losses.log_likelihood()
.- Parameters:
model_vis (
torch.Tensor
oftorch.complex
) – array of the model values representing \(\boldsymbol{V}\)data_vis (
torch.Tensor
oftorch.complex
) – array of the data values representing \(M\)weight (
torch.Tensor
) – array of weight values representing \(w_i\)
- Returns:
the \(\chi^2_\mathrm{R}\), summed over all dimensions of input array.
- Return type:
torch.Tensor
- mpol.losses.r_chi_squared_gridded(modelVisibilityCube: Tensor, griddedDataset: GriddedDataset) Tensor [source]#
Calculate the reduced \(\chi^2_\mathrm{R}\) between the complex data \(\boldsymbol{V}\) and model \(M\) visibilities using gridded quantities. Function will return the same value regardless of whether Hermitian pairs are included.
- Parameters:
modelVisibilityCube (
torch.Tensor
oftorch.complex
) – torch tensor with shape(nchan, npix, npix)
to be indexed by themask
fromGriddedDataset
. Assumes tensor is “pre-packed,” as in output frommpol.fourier.FourierCube.forward()
.griddedDataset (
GriddedDataset
object) – the gridded dataset, most likely produced frommpol.gridding.DataAverager.to_pytorch_dataset()
- Returns:
the \(\chi^2_\mathrm{R}\) value summed over all input dimensions
- Return type:
torch.Tensor
- mpol.losses.log_likelihood(model_vis: Tensor, data_vis: Tensor, weight: Tensor) Tensor [source]#
Compute the log likelihood function \(\ln\mathcal{L}\) between the complex data \(\boldsymbol{V}\) and model \(M\) visibilities using
\[\ln \mathcal{L}(\boldsymbol{\theta};\,\boldsymbol{V}) = - N \ln 2 \pi + \sum_i^N w_i - \frac{1}{2} \chi^2(\boldsymbol{\theta};\,\boldsymbol{V})\]where \(N\) is the number of complex visibilities and \(\chi^2\) is evaluated internally using
mpol.losses._chi_squared()
. Note that this expression has factors of 2 in different places compared to the multivariate Normal you might be used to seeing because the visibilities are complex-valued. We could alternatively write\[\mathcal{L}(\boldsymbol{\theta};\,\boldsymbol{V}) = \mathcal{L}(\boldsymbol{\theta};\,\Re\{\boldsymbol{V}\}) \times \mathcal{L}(\boldsymbol{\theta};\,\Im\{\boldsymbol{V}\})\]where \(\mathcal{L}(\boldsymbol{\theta};\,\Re\{\boldsymbol{V}\})\) and \(\mathcal{L}(\boldsymbol{\theta};\,\Im\{\boldsymbol{V}\})\) each are the well-known multivariate Normal for reals.
This function is agnostic as to whether the sum should include the Hermitian conjugate visibilities, but be aware that the normalization of the answer returned will be different between the two cases. Inference of the parameter values should be unaffected. We recommend not including the Hermitian conjugates.
- Parameters:
model_vis (
torch.Tensor
oftorch.complex128
) – array of the model values representing \(\boldsymbol{V}\)data_vis (
torch.Tensor
oftorch.complex128
) – array of the data values representing \(M\)weight (
torch.Tensor
) – array of weight values representing \(w_i\)
- Returns:
the \(\ln\mathcal{L}\) log likelihood, summed over all dimensions of input array.
- Return type:
torch.Tensor
- mpol.losses.log_likelihood_gridded(modelVisibilityCube: Tensor, griddedDataset: GriddedDataset) Tensor [source]#
Calculate \(\ln\mathcal{L}\) (corresponding to
log_likelihood()
) using gridded quantities.- Parameters:
modelVisibilityCube (
torch.Tensor
oftorch.complex
) – torch tensor with shape(nchan, npix, npix)
to be indexed by themask
fromGriddedDataset
. Assumes tensor is “pre-packed,” as in output frommpol.fourier.FourierCube.forward()
.griddedDataset (
GriddedDataset
object) – the gridded dataset, most likely produced frommpol.gridding.DataAverager.to_pytorch_dataset()
- Returns:
the \(\ln\mathcal{L}\) value, summed over all dimensions of input data.
- Return type:
torch.Tensor
- mpol.losses.neg_log_likelihood_avg(model_vis: Tensor, data_vis: Tensor, weight: Tensor) Tensor [source]#
Calculate the average value of the negative log likelihood
\[L = - \frac{1}{2 N} \ln \mathcal{L}(\boldsymbol{\theta};\,\boldsymbol{V})\]where \(N\) is the number of complex visibilities. This loss function is most useful where you are in an optimization or point estimate situation and where you may adjusting the weight or the amplitudes of the data values, perhaps via a self-calibration operation.
If you are in any situation where uncertainties on parameter values are determined (such as Markov Chain Monte Carlo), you should use
mpol.losses.log_likelihood()
.- Parameters:
model_vis (
torch.Tensor
oftorch.complex
) – array of the model values representing \(\boldsymbol{V}\)data_vis (
torch.Tensor
oftorch.complex
) – array of the data values representing \(M\)weight (
torch.Tensor
) – array of weight values representing \(w_i\)
- Returns:
the average of the negative log likelihood, summed over all dimensions of input array.
- Return type:
torch.Tensor
Regularizer loss function API#
- mpol.losses.entropy(cube: Tensor, prior_intensity: Tensor, tot_flux: float = 10) Tensor [source]#
Calculate the entropy loss of a set of pixels following the definition in EHT-IV 2019.
\[L = \frac{1}{\zeta} \sum_i I_i \; \ln \frac{I_i}{p_i}\]- Parameters:
cube (
torch.Tensor
) – pixel values must be positive \(I_i > 0\) for all \(i\)prior_intensity (
torch.Tensor
) – the prior value \(p\) to calculate entropy against. Tensors of any shape are allowed so long as they will broadcast to the shape of the cube under division (/).tot_flux (float) – a fixed normalization factor; the user-defined target total flux density, in units of Jy.
- Returns:
entropy loss
- Return type:
torch.Tensor
- mpol.losses.TV_image(sky_cube: Tensor, epsilon: float = 1e-10) Tensor [source]#
Calculate the total variation (TV) loss in the image dimension (R.A. and DEC). Following the definition in EHT-IV 2019 Promotes the image to be piecewise smooth and the gradient of the image to be sparse.
\[L = \sum_{l,m,v} \sqrt{(I_{l + 1, m, v} - I_{l,m,v})^2 + (I_{l, m+1, v} - I_{l, m, v})^2 + \epsilon}\]- Parameters:
sky_cube (3D
torch.Tensor
) – the image cube array \(I_{lmv}\), where \(l\) is R.A. in \(ndim=3\), \(m\) is DEC in \(ndim=2\), and \(v\) is the channel (velocity or frequency) dimension in \(ndim=1\). Should be in sky format representation.epsilon (float) – a softening parameter in units of [\(\mathrm{Jy}/\mathrm{arcsec}^2\)]. Any pixel-to-pixel variations within each image North-South or East-West slice greater than this parameter will incur a significant penalty.
- Returns:
total variation loss
- Return type:
torch.Tensor
- mpol.losses.TV_channel(cube: Tensor, epsilon: float = 1e-10) Tensor [source]#
Calculate the total variation (TV) loss in the channel (first) dimension. Following the definition in EHT-IV 2019, calculate
\[L = \sum_{l,m,v} \sqrt{(I_{l, m, v + 1} - I_{l,m,v})^2 + \epsilon}\]- Parameters:
cube (
torch.Tensor
) – the image cube array \(I_{lmv}\)epsilon (float) – a softening parameter in units of [\(\mathrm{Jy}/\mathrm{arcsec}^2\)]. Any channel-to-channel pixel variations greater than this parameter will incur a significant penalty.
- Returns:
total variation loss
- Return type:
torch.Tensor
- mpol.losses.TSV(sky_cube: Tensor) Tensor [source]#
Calculate the total square variation (TSV) loss in the image dimension (R.A. and DEC). Following the definition in EHT-IV 2019 Promotes the image to be edge smoothed which may be a better reoresentation of the truth image K. Kuramochi et al 2018.
\[L = \sum_{l,m,v} (I_{l + 1, m, v} - I_{l,m,v})^2 + (I_{l, m+1, v} - I_{l, m, v})^2\]- :param sky_cube
torch.Tensor
: the image cube array \(I_{lmv}\), where \(l\) is R.A. in \(ndim=3\), \(m\) is DEC in \(ndim=2\), and \(v\) is the channel (velocity or frequency) dimension in \(ndim=1\). Should be in sky format representation.
- Returns:
total square variation loss
- Return type:
torch.Tensor
- :param sky_cube
- mpol.losses.sparsity(cube: Tensor, mask: Tensor | None = None) Tensor [source]#
Enforce a sparsity prior on the image cube using the \(L_1\) norm. Optionally provide a boolean mask to apply the prior to only the
True
locations. For example, you might want this mask to beTrue
for background regions.The sparsity loss calculated as
\[L = \sum_i | I_i |\]- Parameters:
cube (
torch.Tensor
) – the image cube array \(I_{lmv}\)mask (
torch.Tensor
oftorch.bool
) – tensor array the same shape ascube
. The sparsity prior will be applied to those pixels where the mask isTrue
. Default is to apply prior to all pixels.
- Returns:
sparsity loss calculated where
mask == True
- Return type:
torch.Tensor
- mpol.losses.UV_sparsity(vis: Tensor, qs: Tensor, q_max: Tensor) Tensor [source]#
Enforce a sparsity prior for all \(q = \sqrt{u^2 + v^2}\) points larger than \(q_\mathrm{max}\).
- Parameters:
vis (
torch.Tensor
oftorch.complex128
) – visibility cube of (nchan, npix, npix//2 +1, 2)qs (
torch.Tensor
oftorch.float64
) – array corresponding to visibility coordinates. Dimensionality of (npix, npix//2)q_max (float) – maximum radial baseline
- Returns:
UV sparsity loss above \(q_\mathrm{max}\)
- Return type:
torch.Tensor
- mpol.losses.PSD(qs: Tensor, psd: Tensor, l: Tensor) Tensor [source]#
Apply a loss function corresponding to the power spectral density using a Gaussian process kernel.
Assumes an image plane kernel of
\[k(r) = \exp(-\frac{r^2}{2 \ell^2})\]The corresponding power spectral density is
\[P(q) = (2 \pi \ell^2) \exp(- 2 \pi^2 \ell^2 q^2)\]- Parameters:
qs (
torch.Tensor
) – the radial UV coordinate (in \(\lambda\))psd (
torch.Tensor
) – the power spectral density cubel (
torch.Tensor
) – the correlation length in the image plane (in arcsec)
- Returns:
the loss calculated using the power spectral density
- Return type:
torch.Tensor