|32-bit float (no quantization)
I recently implemented pastiche—discussed in a prior post—for applying neural style transfer. I encountered a size limit when uploading the library to PyPI, as a package cannot exceed 60MB. The 32-bit floating point weights for the underlying VGG model  were contained in an 80MB file. My package was subsequently approved for a size limit increase that could accommodate the VGG weights as-is, but I was still interested in compressing the model.
Various techniques have been proposed for compressing neural networks—including distillation  and quantization [3,4]—which have been shown to work well in the context of classification. My problem was in the context of style transfer, so I was not sure how model compression would impact the results.
I decided to experiment with weight quantization, using a scheme where I could store the quantized weights on disk, and then uncompress the weights to full 32-bit floats at runtime. This quantization scheme would allow me to continue using my existing code after the model is loaded. I am not targeting environments where memory is a constraint, so I was not particularly interested in approaches that would also reduce the model footprint at runtime. I used kmeans1d—discussed in a prior post—for quantizing each layer’s weights.
Before I implemented support for loading a quantized VGG model, I first ran experiments to see how different levels of compression would impact style transfer. I did not conduct extensive experiments—just a few style transfers at different levels of compression.
quantize.py creates updated VGG models with simulated quantization, and
quantized_pastiche.sh runs style transfer using the updated VGG models. These scripts are in a separate branch I created for the experiments.
The images at the top of this post were generated with Edvard Munch’s The Scream and a photo I took at the Pittsburgh Zoo in 2017. The images below were generated with Vincent van Gogh’s The Starry Night and a photo I took in Boston in 2015. The image captions indicate the compression rate of the VGG model used for the corresponding style transfer.
|32-bit float (no quantization)
I originally decided to compress the model using 6-bit weights, and ran a few additional style transfers to check the quality at this compression level. I modified the code to generate and load VGG models with weights quantized to arbitrary bit widths. Unfortunately, my implementation had a noticeable effect on latency when loading the model, taking almost twenty seconds for a model with weights compressed to 2 bits (I didn’t test for other compression rates, but larger bit widths would presumably take longer).
I subsequently decided to quantize the weights to 8 bits instead of 6 bits, since this allowed for fast processing using PyTorch’s built-in
uint8 type. The VGG file size decreased from 80MB to 20MB, well within the 60MB PyPI limit that I originally encountered. Loading the quantized model takes less than 1 second.
 Simonyan, Karen, and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” ArXiv:1409.1556 [Cs], September 4, 2014. http://arxiv.org/abs/1409.1556.
 Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the Knowledge in a Neural Network.” ArXiv:1503.02531 [Cs, Stat], March 9, 2015. http://arxiv.org/abs/1503.02531.
 Vanhoucke, Vincent, Andrew Senior, and Mark Z. Mao. “Improving the Speed of Neural Networks on CPUs.” In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.
 Han, Song, Huizi Mao, and William J. Dally. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.” ArXiv:1510.00149 [Cs], October 1, 2015. http://arxiv.org/abs/1510.00149.