Marie-Hélène Burle
November 24, 2021
LR: low resolution
HR: high resolution
SR: super-resolution = reconstruction of HR images from LR images
SISR: single-image super-resolution = SR using a single input image
A rather slow history with various interpolation algorithms of increasing complexity before deep neural networks
An incredibly fast evolution since the advent of deep learning (DL)
Pixel-wise interpolation prior to DL
Various methods ranging from simple (e.g. nearest-neighbour, bicubic) to complex (e.g. Gaussian process regression, iterative FIR Wiener filter) algorithms
Simplest method of interpolation
Simply uses the value of the nearest pixel
Consists of determining the 16 coefficients \(a_{ij}\) in:
\[p(x, y) = \sum_{i=0}^3\sum_{i=0}^3 a\_{ij} x^i y^j\]
Deep learning has seen a fast evolution marked by the successive emergence of various frameworks and architectures over the past 10 years
Some key network architectures and frameworks:
These have all been applied to SR
Given a low-resolution image Y, the first convolutional layer of the SRCNN extracts a set of feature maps. The second layer maps these feature maps nonlinearly to high-resolution patch representations. The last layer combines the predictions within a spatial neighbourhood to produce the final high-resolution image F(Y)
Can use sparse-coding-based methods
Do not provide the best PSNR, but can give more realistic results by providing more texture (less smoothing)
Followed by the ESRGAN and many other flavours of SRGANs
Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In Advances in neural information processing systems (pp. 2204-2212)
(cited 2769 times)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008)
(cited 30999 times…)
Initially used for NLP to replace RNN as they allow parallelization Now entering the domain of vision and others Very performant with relatively few parameters
The Swin Transformer improved the use of transformers to the vision domain
Swin = Shifted WINdows
Swin transformer (left) vs transformer as initially applied to vision (right):
DIV2K, Flickr2K, and other datasets
3 metrics commonly used:
\(\frac{\text{Maximum possible power of signal}}{\text{Power of noise (calculated as the mean squared error)}}\)
Calculated at the pixel level
Prediction of perceived image quality based on a “perfect” reference image
Mean of subjective quality ratings
\[PSNR = 10\,\cdot\,log_{10}\,\left(\frac{MAX_I^2}{MSE}\right)\]
\[SSIM(x,y) = \frac{(2\mu_x\mu_y + c_1) + (2 \sigma _{xy} + c_2)} {(\mu_x^2 + \mu_y^2+c_1) (\sigma_x^2 + \sigma_y^2+c_2)}\]
\[MOS = \frac{\sum_{n=1}^N R\_n}{N}\]
import kornia
psnr_value = kornia.metrics.psnr(input, target, max_val)
ssim_value = kornia.metrics.ssim(img1, img2, window_size, max_val=1.0, eps=1e-12)
See the Kornia documentation for more info on kornia.metrics.psnr & kornia.metrics.ssim
A dataset consisting of 5 images which has been used for at least 18 years to assess SR methods
From the HuggingFace Datasets Hub with the HuggingFace datasets package:
A 2012 review of interpolation methods for SR gives the metrics for a series of interpolation methods (using other datasets)
The Papers with Code website lists available benchmarks on Set5
# Get the model
git clone git@github.com:JingyunLiang/SwinIR.git
cd SwinIR
# Copy our test images in the repo
cp -r <some/path>/my_tests /testsets/my_tests
# Run the model on our images
python main_test_swinir.py --tile 400 --task real_sr --scale 4 --large_model --model_path model_zoo/swinir/003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_GAN.pth --folder_lq testsets/my_tests
Ran in 9 min on my machine with one GPU and 32GB of RAM
We could use the PSNR and SSIM implementations from SwinIR, but let’s try the Kornia functions we mentioned earlier:
Let’s load the libraries we need:
Then, we load one pair images (LR and HR):
berlin1_lr = Image.open("<some/path>/lr/berlin_1945_1.jpg")
berlin1_hr = Image.open("<some/path>/hr/berlin_1945_1.png")
We can display these images with:
Now, we need to resize them so that they have identical dimensions and turn them into tensors:
torch.Size([3, 267, 256])
torch.Size([3, 267, 256])
We now have tensors with 3 dimensions:
As data processing is done in batch in ML, we need to add a 4th dimension: the batch size
(It will be equal to 1
since we have a batch size of a single image)
Our new tensors are now ready:
torch.Size([1, 3, 267, 256])
torch.Size([1, 3, 267, 256])
psnr_value = kornia.metrics.psnr(batch_berlin1_lr_t, batch_berlin1_hr_t, max_val=1.0)
psnr_value.item()
33.379642486572266
ssim_map = kornia.metrics.ssim(
batch_berlin1_lr_t, batch_berlin1_hr_t, window_size=5, max_val=1.0, eps=1e-12)
ssim_map.mean().item()
0.9868119359016418