Abstract

This work aims for transferring a Transformer-based image compression codec from human perception to machine perception without fine-tuning the codec. We propose a transferable Transformer-based image compression framework, termed TransTIC. Inspired by visual prompt tuning, TransTIC adopts an instance-specific prompt generator to inject instance-specific prompts to the encoder and task-specific prompts to the decoder. Extensive experiments show that our proposed method is capable of transferring the base codec to various machine tasks and outperforms the competing methods significantly. To our best knowledge, this work is the first attempt to utilize prompting on the low-level image compression task.

Paper

Rate-distortion Results

The rate-accuracy plots for the competing methods. The methods are evaluated on three machine tasks: classification, object detection, and instance segmentation. For classification, we use ImageNet-val as the test set and a pre-trained ResNet50 as the downstream recognition network. For object detection and instance segmentation, we test the competing methods on COCO2017-val using a pre-trained Faster R-CNN and Mask R-CNN as the downstream recognition networks, respectively.


We also compare our TransTIC with the methods recently submitted to the call-for-proposals (CFP) competition of the MPEG VCM standard based on their test protocol. The results of these competing methods are from the CFP test report (m61010). As shown in Fig. B1, our TransTIC performs comparably to the top performers in terms of rate-accuracy performance. However, our base codec has the additional constraint that it is optimized for human perception, while the top performers (e.g. p12, p6, p7) optimize the entire codec end-to-end for machine tasks. This shows the potential of our TransTIC.

Qualitative Comparison

Decoded images and the bit allocation maps produced by the competing methods. As shown, $TIC$, the codec optimized for human perception, tends to allocate more bits to complex regions, even if those regions are less relevant (e.g. background) to the downstream recognition tasks. In contrast, the other methods, which target machine perception, attempt to shift coding bits from the background regions to the foreground objects.