TransTIC: Transferring Transformer-based Image Compression from Human Perception to Machine Perception

Abstract

This work aims for transferring a Transformer-based image compression codec from human perception to machine perception without fine-tuning the codec. We propose a transferable Transformer-based image compression framework, termed TransTIC. Inspired by visual prompt tuning, TransTIC adopts an instance-specific prompt generator to inject instance-specific prompts to the encoder and task-specific prompts to the decoder. Extensive experiments show that our proposed method is capable of transferring the base codec to various machine tasks and outperforms the competing methods significantly. To our best knowledge, this work is the first attempt to utilize prompting on the low-level image compression task.

Method

The figure illustrates our transferable Transformer-based image compression framework, termed $TransTIC$. It is built upon $TIC$, except that the context prior model is replaced with a simpler Gaussian prior for entropy coding. As shown, the main autoencoder $g_a$, $g_s$ and the hyperprior autoencoder $h_a$, $h_s$ include Swin-Transformer blocks (STB) as the basic building blocks. These STB are interwoven with convolutional layers to adapt feature resolution in the data pipeline. In this work, the main and hyperprior autoencoders are pre-trained for human perception (i.e. the image reconstruction task) and their network weights are fixed during the transferring process.

To transfer $g_a$, $g_s$ such that the decoded image $\hat{x}$ is suitable for machine perception, we inject (1) instance-specific prompts produced by gp into the first two STBs in $g_a$ and (2) task-specific prompts into all the STBs in $g_s$. We note that the prompt generator $g_p$ and the task-specific prompts input to the decoder are learnable and updated according to the machine perception task. That is, the network weights of $g_p$ are task-specific. However, the prompts produced by $g_p$ are instance-specific because they are dependent on the input image.

Paper

Rate-distortion Results

The rate-accuracy plots for the competing methods. The methods are evaluated on three machine tasks: classification, object detection, and instance segmentation. For classification, we use ImageNet-val as the test set and a pre-trained ResNet50 as the downstream recognition network. For object detection and instance segmentation, we test the competing methods on COCO2017-val using a pre-trained Faster R-CNN and Mask R-CNN as the downstream recognition networks, respectively.

We also compare our TransTIC with the methods recently submitted to the call-for-proposals (CFP) competition of the MPEG VCM standard based on their test protocol. The results of these competing methods are from the CFP test report (m61010). As shown in Fig. B1, our TransTIC performs comparably to the top performers in terms of rate-accuracy performance. However, our base codec has the additional constraint that it is optimized for human perception, while the top performers (e.g. p12, p6, p7) optimize the entire codec end-to-end for machine tasks. This shows the potential of our TransTIC.

Qualitative Comparison

Decoded images and the bit allocation maps produced by the competing methods. As shown, $TIC$, the codec optimized for human perception, tends to allocate more bits to complex regions, even if those regions are less relevant (e.g. background) to the downstream recognition tasks. In contrast, the other methods, which target machine perception, attempt to shift coding bits from the background regions to the foreground objects.