A Step By Step Guide to Selecting and Running Your Own Generative Model

The past months have witnessed a drastic reduction in the parameters size of the various generative models, such as the new Mistral AI's model that just came out. The reduction in size opens the door to getting your own personal assistant AI enabled that can be tied to you through your local computer. This type of local inference is very tempting to ensure confidential computation on your data. With all these new developments, deploying and managing an AI workload looks different than it did 6 months ago and is constantly evolving. How to use one of these models to play around with it or even to host it on your company's infrastructure?
I think that before using any kind of API model that will be hosted by someone else, it is a good thing to experiment with different type of models to get a sense of how these different model families perform. So let's assume you are not using an API model right away. How do you pull down a model and use it?
For this you have two type of models: proprietary and open access models. The proprietary models will be OpenAI, Cohere and so on and they all have their own API. The open access ones can be fully open or half-restricted models due to their license such as commercial, non-commercial, research purposes only …
The best place to find these models is on HuggingFace. On the models page, you can see that hey have over 350,000 models available across a very diverse set of tasks. So you do have a few to choose from!

Something to have in mind is that not all of these models are being/will be used. Some of them could just be somebody trying something out during an afternoon and then never updated it again. One of they key metrics to find the most useful models is to look at how many people downloaded the model and liked it. For example, you filter on the type of task you are looking for, such as Text Classification, and from there you can see which are the most downloaded and the trending models filtered by licenses and so on. This gives you a good overview of the landscape of the models for the task you have in mind and where to find them.
Going to the Text Generation task, which is the main topic of conversation in generative models at the moment, you can see that the trending model is the new Mistral model (https://mistral.ai/news/announcing-mistral-7b) with 7 billions parameters. Now, with all of these models available on HuggingFace, how to know which model will be appropriate for your task?
The first thing to note, as you can see on the screenshot below, is that when clicking on a model you land on the model card and most of them already have a hosted interactive interface where you can get a hint of the output of the model. In addition of this hosted interactive interface, you can see under it what is called Spaces. Spaces are applications hosted on HuggingFace where people have integrated the model you are looking at. All of these interfaces are really handy at getting a sense of what all of these models do.

Selecting and Running a Model
Of course the constraints during the model selection are going to depend on the infrastructure and hardware available to you. A good rule of thumb is that going above the 7 billions parameters for transformers models will make it hard to run it well on standard consumer GPUs. It is worth noting that you might find model optimization pieces to run such models on consumer hardwares that have been created. For example it could be a change in model size or even computing precision and that could be appropriate for your specific task and allow you to run the model on your hardware.
In any case, a good idea is to start with smaller problems and to then build up to something that solves your task. Once you figured out what model and customization you need, you can have a look at what it will mean for your hardware requirements and if it is something that can be done. For example, let's consider that for the text generation task I have in mind, the Mistral 7B model fits my needs in term of hardware requirement. Then the model card on HuggingFace will give you instruction on how to download and run it. I usually use Google Collab (https://colab.research.google.com/?utm_source=scs-index) to get an idea on the running time and resources usage but you could use any other platform too such as Kaggle.

In the case of Mistral 7B, the model runs on the basic Google Collab engine with 12GB of memory. So that gives you an idea of the resources needed for a simple inference of the model and if you have to go on the path of optimizing the model for it to run on fewer resources or if you do have the resources needed.
After having done your preliminary work on selecting the model and getting a sense of the resources needed, you might be constrained by your current hardware, leaving you no choice but to try to optimize your model. Lucky for you, there is a GitHub repo (https://github.com/intel-analytics/BigDL) that goes through the different set of optimization for big deep learning that you would require. It allows, for example, to run a llama language model on a standard computer. On there you will be able to find if the model you are looking for is optimizing for CPU computing (or single GPU at least) and if it would fit your requirements.
In this blog post, I described the first steps in selecting and running your model for a machine learning task. If you are interested in learning more about model modifications for a specific task and model deployment, HuggingFace has a great course (https://huggingface.co/learn/nlp-course/chapter1/1) available for free.
If you enjoyed this story, don't hesitate to show your appreciation by clapping or leaving a comment! Follow me on Medium for more content about Data Science! You can also connect with me on LinkedIn.