A Step-by-Step Tutorial for Creating Inference API and Deploying on Colab
Image by storyset on Freepik
Suppose you’re the one who is passionate about LLM technology, chances are, you’ve already created a few applications to assist your work or daily life, by utilizing commercial APIs like GPT-4 API. In the meantime, with the remarkable improvement in performance, open-source language models such as Llama2 are bound to catch your attention, inviting you to experiment and evaluate them.
Unfortunately, most solo developers don’t afford an expensive GPU to host open models locally and aren’t ready to invest in a dedicated cloud for high usage cost online. In such cases, relying on platforms like Google Colab becomes essential. Google Colab Notebook provides the necessary infrastructure for experimenting and evaluating open-source language models for free or low cost by price calculation on a runtime basis. The notebook design with resources is quite helpful, however, it’s hard to create any application with a decent user interface or even harder to share the access of your runtime with others on Colab.
Then, the idea of making free RESTful APIs for open language models comes to my mind.
1. Project Overview
In this project, we are going to deploy an open-source language model “Dolly-v2–3b” on Colab with a free T4 GPU and port its inference to a RESTful API exposed for online access.
a) API Definition
The desired API definition is as followed:
API Endpoint:
POST /chatbot
Description: This API endpoint allows you to interact with the chatbot model to generate text responses based on a given prompt.
Request:
- Method: POST
- Endpoint:
https://da7c-34-127-37-191.ngrok.io/chatbot
- Content-Type:
application/json
Request Body:
{
"llm": str,
"temperature": float,
"top_k": int,
"prompt": str
}
Parameters:
llm
(string, required): The model name for the chatbot