Deep Learning Model Optimization: Why and How?



 

 

Why do we need model optimization?

1. Foundational Models

These days there are lots of pre-trained foundational models available in both the vision and text domain. Models like Segment Anything and DINOv2 are pretrained on millions (Segment Anything and DINOv2 are trained on 11, and 142 million data respectively) of data. In the text domain, we have Llama 2, vicuna, and so on. Finetuning them on new tasks will always give better results than training anything from scratch. However, these foundational models are generally large vision/language models. The idea here is to take advantage of the large-scale pretraining.

One problem with foundational models is that they will require high resources during training even if we optimize the inference. Recently papers like LoRA (Low-Rank Adaptation of Large Language Models), QLoRA (Efficient Finetuning of Quantized LLMs), and so on have reduced the computational requirement of training those large foundational models. These papers fall inside the category called Parameter efficient finetuning methods, which is getting lots of attention because of these foundational models.

However, because of the large model size, we will not always be able to deploy them in production and need optimization methods to make the inference faster.

On a side note, vision language models (like CLIP, BLIP-2, and so on) can also be used as pretrained encoders depending on tasks.

2. Train large then compress is better than training compressed model

There are two aspects while comparing a large model and compression against a compressed model. Normally compressed model will complete an iteration much faster than a large model. So, will training time and compute requirements increase? Also, after the compression of the large model, how much performance is lost?

There is a paper Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers from UC Berkeley, which mentions wider and deeper models converge much faster than smaller models, and are much easier and more robust to compress compared to smaller models.

Wider models converge faster than narrower models as a function of both gradient steps (left plot) and wall-clock time (right plot).

Increasing Transformer model size results in lower validation error as a function of wall-clock time and better test-time accuracy for a given inference budget. The right image is from an experiment that takes ROBERTA checkpoints that have been pretrained for the same amount of wall-clock time and finetunes them on a downstream dataset (MNLI). Models are then compressed according to the inference memory budget, the best models for a given memory are ones that are trained large and then heavily compressed.

Along with parameter-efficient fine-tuning and the finding that large models converge faster, training time and compute requirements will not be a problem.

Also, from the Validation Accuracy vs Number of Parameter diagram, it’s clear that for a given number parameter large heavily compressed models work better than smaller and less compressed models.

3. Training large models and then optimizing makes more sense for business

More often than not the models need to be deployed on multiple devices like cloud deployment, and on-premise deployment. On cloud deployment, we can use larger models but for on-premise deployment, we might need to optimize based on client requirements. Training models separately and maintaining them will be more hassle than training one large performing model and optimizing it according to requirements.

Different model optimization techniques

Two of the most common model optimization methods are Quantization and Pruning.

Bestseller No. 1
Pwshymi Printhead Printers Head Replacement for R1390 L1800 Printhead R390 R270 R1430 1400 for Home Office Printhead Replacement Part Officeproducts Componentes de electrodomésti
  • Function Test: Only printer printheads that have...
  • Stable Performance: With stable printing...
  • Durable ABS Material: Our printheads are made of...
  • Easy Installation: No complicated assembly...
  • Wide Compatibility: Our print head replacement is...
Bestseller No. 2
United States Travel Map Pin Board | USA Wall Map on Canvas (43 x 30) [office_product]
  • PIN YOUR ADVENTURES: Turn your travels into wall...
  • MADE FOR TRAVELERS: USA push pin travel map...
  • DISPLAY AS WALL ART: Becoming a focal point of any...
  • OUTSTANDING QUALITY: We guarantee the long-lasting...
  • INCLUDED: Every sustainable US map with pins comes...

Quantization stores the model weights in low-precision (fp32 to fp16/int8/bfloat16) format to accelerate matrix operations when using hardware with reduced precision and to reduce overall memory footprint. During quantization, we occasionally quantize only part of the weights and activations instead of quantizing everything.

Post-training quantization is a technique to quantize (clip, scale, round, and so on) the model weights after the model has been trained.

Quantization-aware training is a training method that involves simulating the quantization effects during the training phase itself. Instead of training the model with high-precision floating-point numbers, the model is trained with lower-precision representations, emulating the quantization that will be applied later during deployment. This helps the model adapt to the reduced precision and minimizes the loss of accuracy caused by quantization. Quantization-aware training is often better than post-training quantization. Also, don’t confuse this with mixed-precision training.

Model pruning sets the network weights to zero which reduces the number of operations directly and reduces the memory footprint as well.

Pruning algorithms can drop individual weights or groups of weights. They are called unstructured and structured pruning respectively.

Unstructured and structured pruning. Image source.

Note, that model inference can be optimized with parallel model inference, converting the model to a static computational graph, and so on.

We will go over how the different quantization and pruning techniques are used in a model in separate blogs.

Resources

  1. LoRA Paper
  2. QLoRA Paper
  3. Train Large, Then Compress Paper
  4. Pruning Neuralmagic

Connect with me

New
ABYstyle - Call of Duty Toiletry Bag Search and Destroy, Black, 26 x 14 x 8.5 cm, Handle on pencil case for easy carrying, Black, 26 x 14 x 8.5 cm, Handle on pencil case for easy carrying
  • 100% official
  • Very practical with multiple pockets
  • Handle on pencil case for easy carrying
  • Material: Polyester
  • Dimensions: 26 x 14 x 8.5 cm
New
1890 Wing Angel Goddess Hobo Morgan Coin Pendant - US Challenge Coin Liberty Eagle Novel Coin Adult Toy Funny Sexy Coin Lucky Coin Pendant Storage Bag for Festival Party
  • FUNNY COIN&BAG: You will get a coin and jewelry...
  • NOVELTY DESIGN: Perfect copy the original coins,...
  • LUCKY POUCH: The feel of the flannelette bag is...
  • SIZE: Fine quality and beautiful packing. Coin...
  • PERFECT GIFT: 1*Coin with Exquisite Jewelry Bag....
New
Panther red Fleece Beanie
  • German (Publication Language)

Feel free to drop me a message or

  1. Connect and reach out on LinkedIn
  2. Follow me on Medium or GitHub

Have a nice day ❤️

Original Post>