As LLM technologies gain more mainstream adoption and the ecosystem begins to mature, organizations are beginning to recognize the limits and costs of using LLM technologies. Many enterprises, originally enthusiastic about the application of LLM technologies, have abandoned centralized initiatives, pursuing instead a strategy of encouraging decentralized efforts to incorporate services like ChatGPT and Claude into their workflows.
There are several reasons for this phenomenon. Lack of LLM expertise, MLOps requirements, and reliance on specialized GPU infrastructure are all barriers to implementing large-scale AI initiatives. Of these, however, the thorniest issue is the reliance on GPUs.
In this article, we will discuss the specific difficulties posed by GPU reliance, explore a potential solution, and look at an exciting example from one of the trailblazing companies working in this area.
GPU Availability as Limitations to LLMs
Most publicly available and highly performant models, such as GPT-4, Llama 2, and Claude, all rely on highly specialized GPU infrastructure. GPT-4, one of the largest models commercially available, famously runs on a cluster of 8 A100 GPUs. Llama 2’s 70B model, which is much smaller, still requires at least an A40 GPU to run at a reasonable speed.
This level of GPU requirement practically forecloses the possibility of running these models locally - a A100 GPU, assuming you can find a seller, costs close to $25,000. Once you obtain the GPUs, you need specialized skills to set up and maintain the servers. Very few organizations would be willing to incur such an outlay to experiment with LLM technologies.
To solve this problem, several startups and cloud providers have developed extensive PaaS offerings. Some services like Replicate, which I have used in past articles and projects, allow users to rent GPU servers and pay for compute time used. Other providers, like OpenAI and Anthropic, offer their models as a per-token API, further abstracting away infrastructural complexities. However, these services require data to be sent to an external network, which makes using these services a non-starter for privacy-conscious organizations. Additionally, many of these services suffer from shortages during demand spikes as GPU usage overtakes availability, making them unreliable options for production-critical workloads.
Additionally, GPU time, no matter how they are charged for, is expensive for large compute tasks - the companies that own and operate these GPUs need a return on their investment, after all. While these costs are nearly negligible for experimental use cases, commercial use cases often require embedding large contexts, fine-tuning, or multi-shot examples. These costs represent a significant barrier to adoption, especially for organizations with large datasets or those that lack the financial resources of large U.S. firms.
In a previous article, we explored parameter compression as one strategy to reduce dependence on GPUs. In today’s article, we will explore another exciting technique called quantization.
Before we dive into the exploration, however, we might want to learn a little bit about quantization first.
Quantization (Optional Reading)
In this section, we will briefly go over the basics of quantization. However, if you’re simply looking for a way to run powerful LLMs locally on your computer, you can feel free to skip this section for now and come back later. LLMWare, the company whose technology we will be using today, has built some amazing tools that let you get started with quantized models without having to get into the nitty-gritty C/C++ implementations behind it all.
What is Quantization?
Quantization is a technique that seeks to reduce the computational and memory requirements of running an LLM by using lower-precision numerical types. Many popular open-source models, such as Llama, Falcon, and Alpaca, use PyTorch as the underlying framework. By default, PyTorch models use 32-bit floating points, meaning a single parameter takes up 32 “bits” in the GPU memory. Quantization aims to replace these parameters with 16-bit floating points, 8-bit integers, or even 4-bit integers. Successful quantization leads to dramatic improvements in computational speed and reductions in memory usage, meaning large models become runnable on lower-end GPUs, embedded graphic chips, or even CPUs. This idea has been around for a while - PyTorch itself has added support for 16-bit floating points and model compilations as the technology matured, but progress has been slow due to early design decisions in the PyTorch framework.
Does Quantization Degrade Performance?
At this point, it’s natural to wonder, wouldn’t this severely degrade the accuracy of the model? The short answer is yes, but only if you do it carelessly. Every optimization comes with inherent trade-offs, but with some specialized techniques, researchers have been able to squeeze incredibly stable performance out of highly quantized models. While we won’t go into extreme technical detail, let’s go over the broad strokes of the most common strategies being used right now. If you want to learn more, you can find out more about it in a guide from HuggingFace.
During the quantization process, a calibration dataset is run through the model. The value of each parameter is recorded, and the range is used to determine how the parameters get quantized. Assuming the calibration dataset is representative of the inputs that the model will encounter, this results in improved accuracy of the resulting model.
Whereas Calibrated Quantization happens after training, Quantization-Aware Training tries to optimize the model during training. While the model is training, the activations are put through a “fake quantization,” simulating errors that will likely be introduced by the quantization process. The model is then able to adapt to the errors, resulting in a more robust model that can specifically adapt to the potential distortions.
Llama.cpp and GGUF
While PyTorch quantization and optimizations have been long blocked by framework design, two recent open-source technologies broke through these barriers and made quantization technologies much more accessible to the general public. Let’s briefly cover them below.
Llama.cpp was a project by Georgi Gerganov to port the Llama model into C/C++. This got rid of the complexity introduced by PyTorch, and the native implementation allowed quantization to be implemented directly. Therefore, the resulting model could run with up to 4-bit integer quantization, allowing high-parameter-count Llama models to be run without a specialized GPU.
The project has since been extended by the community to include a roster of open-source models, including popular ones like Falcon and Mistral.
GGUF is Llama.cpp’s file format for storing and transferring model information. Quantized models are stored in this format so that they can be loaded and run by the end-user. GGUF is the successor format to GGML and aims to improve on GGML by providing more extensibility, backward compatibility, and stability while still allowing for rapid development.
The development of a universal file format opened the door for the open-source community to extend the Llama.cpp to optimize other models, and innovators like TheBloke and LLMWare have been working over the last several months to miniaturize popular open-source models.
LLMWare’s Quantized Dragon Model
In today’s example, we will be using open-source libraries and quantized models provided by LLMWare, which provides convenient tools for quickly building specialized RAG workflows.
Who is LLMWare?
LLMWare, a generative AI company specializing in the legal and financial industries, has been actively involved in the quantization community. As I’ve written before, their focus on privacy-conscious sectors makes them a natural candidate for experimenting with and innovating in miniaturization technologies.
Previously, I wrote about their RAG-optimized BLING models that squeeze incredible performance out of 1 to 3 billion parameter models for specialized tasks like contract review and financial analysis. While most open-source models with such parameter counts tend to be only useful for toy problems, LLMWare is able to generate production-ready performance out of these models by training them for narrowly targeted tasks. These miniaturized models are then able to run without an external GPU, allowing for increased privacy and scalability.
What is Dragon?
Dragon is a collection of LLMs that can be thought of as more powerful versions of their BLING cousins. The original intent of Dragon was to train a higher parameter model using the same instruction fine-tuning techniques, providing an option to users that need more performance and have access to lower-end GPUs.
The added parameter count resulted in more powerful models that could leverage larger context windows and generate more complex outputs, but required the user to have more specialized hardware, such as a GPU-embedded laptop or a cloud compute container with a GPU attached. However, they still represented an improvement over the extremely large models, which would require waiting for access to scarce A40 or A100 GPUs.
Quantized Dragon, the Best of Both Worlds
Given the above, it is easy to see why quantization gave a significant boost to LLMWare’s suite of AI tools. With quantization, a user could run Dragon-tier models on the same environment as BLING models, allowing for much more powerful analysis on commodity computers.
Over the course of the last month, LLMWare has published quantized versions of several Dragon models. Today, we will evaluate LLMWare’s Dragon model built on top of Llama with a legal analysis RAG problem and compare it with a similar BLING model. Interested reachers can also explore other models - a Mistral-based model and a Yi-Based model are available from LLMWare at the time of this writing. Additionally, LLMWare has made running inferences on Llama.cpp models a breeze with their tight integrations with the ctransformers library, which allows gguf models to be swapped seamlessly with PyTorch based models.
We will use a Macbook Air with M1 Chip for this experiment, meaning we will only be using widely available hardware for this exercise.
Testing Out Quantized Dragon
Remember that in my previous article, we built a RAG application focused on legislation search. We used vector search to quickly search through several large legislations, found sections relevant to our question about Qualified Opportunity Zone Partnership Interest, and ran the question through a BLING model. In today’s article, we will run the same question through LLMWare’s quantized Dragon model and determine if it performs better than BLING models.
In order to focus on model comparison and to reduce the amount of prior knowledge required, we will do a lot of the PDF parsing and vector search manually. This has the added benefit of making the problem artificially harder for the model - LLMWare’s default embedding search chunks the source material to about 1000 tokens, but handling the parsing manually allows us to bump the context up to around 3000 tokens. This will help us clearly demonstrate the difference between the Dragon and BLING models.
However, you should be able to easily integrate with the rest of the LLMWare’s ecosystem if you want to leverage their tools by following the setup steps from my last article on LLMWare. In fact, if you simply replace the name of the BLING models with the quantized Dragon model from this article, everything should run seamlessly.
Without further ado, let’s get started!
First, let’s import the required dependencies:
import sklearn.metrics # for cosine similarity
from llmware.prompts import Prompt
from openai import OpenAI
from PyPDF2 import PdfReader
client = OpenAI() # the library now loads the key automatically as an environment variable.
We can now load the PDF. In the previous example, we loaded several large legislations, but for today, we will only focus on the PDF version of the Tax Cuts and Jobs Act of 2017.
reader = PdfReader([path to PDF of tax cuts and jobs act])
Now we can generate the embeddings for each page:
embeddings = 
for pg in reader.pages:
text = pg.extract_text()
Let’s also generate the embeddings for the question we’re going to ask:
question = 'What is a qualified opportunity zone partnership interest?'
q_embed = client.embeddings.create(
With the embedding in hand, we can perform a vector search. Because our search space is small, we can just do this manually.
cos_sim = [(idx, sklearn.metrics.pairwise.cosine_similarity([e], [q_embed])) for idx, e in enumerate(embeddings)]
Now we can take the most relevant page (which is index 132 or page 133 if you want to verify the results):
most_relevant = sorted(cos_sim, key=lambda x: x, reverse=True)
And with that, we have come to the most crucial step. We will instantiate an LLMWare Prompter object with the quantized Llama Dragon model. The Prompter class is key here because it handles the prompt engineering for us and makes sure that our prompt is consistent with the structure of Dragon’s training data. The prompt class also automatically handles the llamacpp binding, so you can use the quantized Dragon model exactly like other models.
model_name = "llmware/dragon-llama-7b-gguf"
prompter = Prompt().load_model(model_name)
response = prompter.prompt_main(question, context='\n\n'.join([reader.pages.extract_text()]),
Wait a little while, and you should see the function call return. Now print the results:
And you should see something like the following:
• A capital or profits interest acquired by the qualified opportunity fund after December 31, 2017, from the partnership solely in exchange for cash;
•As of the time such interest was acquired, the partnership was a qualified opportunity zone business (or, in the case of a new partnership, it was being organized for purposes of being a qualified opportunity zone business);
•During substantially all of the qualified opportunity fund's holding period for such interest, the partnership qualified as a qualified opportunity zone business.
This is quite a good answer!
For comparison, let’s see how a BLING model would perform on the same problem. One of the issues we can expect is that the large context size can “overwhelm” a lower-parameter model and lead to a less informative answer. In my previous experiments, the sheared llama 2.7b was one of the best performers for this problem, so I decided to use that as the representative of the BLING models.
model_name_2 = "llmware/bling-sheared-llama-2.7b-0.1"
prompter2 = Prompt().load_model(model_name_2)
response = prompter2.prompt_main(question, context='\n\n'.join([reader.pages.extract_text()]),
After some processing, you should see something like this.
A qualified opportunity zone partnership interest is a capital or profits interest in a domestic partnership if such interest is acquired by the qualified opportunity fund after December 31, 2017, from the partnership solely in exchange for cash.
The response is still good but misses some of the details captured by the Dragon model. Specifically, the answer misses the holding period requirement and the new business case. This conforms to our expectations about the lower-parameter models’ difficulty with processing larger contexts. Interested readers can extend this experiment by using even lower parameter models or increasing the size of the given context. You should see the effect become increasingly more pronounced, after which the model will give a short, garbled answer.
From this experiment, it should be clear that quantized Dragon models are able to outperform lower-parameter models for their intended use cases without noticeably compromising the model’s accuracy.
And with that, we have used a quantized model to solve a real-world use case and learned about its performance characteristics in the process!
Today, we explored the exciting field of LLM quantization and looked at how companies like LLMWare are taking advantage of these developments to enhance their specialized language models. As I’ve argued before, miniaturization represents one of the most promising paths to the widespread adoption of AI technologies. By combining specialization, fine-tuning, and quantization, innovators in the AI space can create scalable and performant models that solve real-world problems.
By the way, I am working on an exciting project that seeks to use language AI and miniaturization to revolutionize education in the developing world. We’re working with incredible activists and educators worldwide, and we’re working to bridge the global digital divide. If you’d like to learn more about my project or simply want to talk about exciting developments in the LLM space, please do not hesitate to reach out to me on either Github or LinkedIn.