Wed, Nov 22, 2023
In the ever-evolving landscape of AI, the demand for powerful Large Language Models (LLMs) has surged. This has led to an unrelenting thirst for GPUs and a shortage that causes headaches for many organizations. But what if there was a way to use LLMs efficiently without relying solely on scarce GPUs?
Numenta, a Bay Area-based company with nearly two decades in neuroscience research — possibly recognized for the influential book A Thousand Brains: A New Theory of Intelligence published by its Co-Founder — has recently launched its first commercial product: NuPIC.
The name stands for “Numenta Platform for Intelligent Computing,” and it's an LLM acceleration platform that applies neuroscience principles to achieve remarkable speedups on CPU-based LLM inference, promising even to surpass GPU performance in many scenarios.
Through a partnership with Numenta, we got early access to NuPIC and in this blog post, we will explore the platform’s architecture and then delve into its capabilities, showing you how to solve a classical use case of semantic search. To test the CPU vs GPU claims, we will do comparative benchmarks against Hugging Face's BERT on CPU and GPU and set the stage for discussing its pros, cons, and potential applications.
Join us on this exploration as we examine NuPIC, the claims surrounding its performance, and its potential role in LLM inference.
Tryolabs has partnered with Numenta to support the development of NuPIC. Our team worked closely with Numenta engineers to design and implement key infrastructure for NuPIC's modules, including the Inference Server and Training Module.
This hands-on involvement has provided us with valuable technical expertise and experience using NuPIC. While we aim to provide an objective view of the platform, our collaboration with Numenta is relevant to disclose upfront.
NuPIC aims to accelerate LLM inference on CPUs. But what does that look like in practice? It currently consists of two docker containers, which allow easy integration into any AI inference infrastructure:
This server runs on a CPU and presents a simple interface to get LLM model inference for different NLP tasks, such as classification, Q&A, or embeddings. Key features:
Key features:
NuPIC is delivered with a set of optimized BERT and BERT Large models.
This module runs on GPUs and allows fine-tuning the optimized models on your own data.
Key features:
The NuPIC Training Module includes a Weight & Biases integration for experiment tracking.
With NuPIC's architecture in mind, let's see it in action for a semantic search use case.
Imagine you're a developer building a content-rich website for a travel agency. Your goal is to empower users to find their dream destinations effortlessly, making the most out of the intent in their words.
In the world of travel, words carry a world of meaning. You want users to type in their travel desires, whether it's "beach paradises," "cultural escapades," or "adventurous getaways”, and receive tailored recommendations that capture the essence of their desires. Enter semantic search – the art of understanding not just keywords but the intent behind them. But the question is: how to perform this intent understanding?
Traditional keyword-based search engines often fall short when capturing semantic nuances; that’s why an AI approach is more suitable here, given the recent rise of LLMs and their interpretation power.
A BERT-based model is appropriate for this task, as they are specifically trained to represent text with embeddings — a mathematical representation that encapsulates the contextual meaning of the text. In particular, Sentence-BERT, or SBERT for short, is a powerful language model specifically trained to better understand the structure of whole sentences, much more than its predecessors. Thus making it doubly appropriate for retrieving meaningful embeddings and paving the way for a more intelligent and intuitive search experience.
Now, we'll guide you through implementing your search functionality using NuPIC's SBERT model. From infrastructure setup to real-time embeddings and personalized recommendations, let's explore the journey to enriching user interactions and understanding their intent.
As of September 2023, NuPIC is available, and you can contact Numenta to request a demo.
To get started, you'll need a Linux machine equipped with an Intel CPU that supports the AMX instruction set, at least 6GB of RAM, and has Docker installed.
You can check the AWS EC2 instance types with AVX-512 and AMX support, or the GCP instance types with AVX-512 and AMX support.
Then, you must download NuPIC’s Inference Server container using the Numenta-provided download.sh
script and run it with the run_inference_server.sh
script.
This server will play a vital role in generating embeddings and facilitating semantic searches.
Vector Datasets, such as Chroma or Pinecone, are a great solution to store, manage, and query embeddings. You must configure your vector dataset and keep all the travel guide summaries as embeddings.
You'll need to convert the travel guide summaries into embeddings using NuPIC's SBERT model to achieve this. Thankfully, it’s very straightforward: here is an example code of how to generate the embeddings using the NuPIC client in Python for a fixed set of sentences:
The generated embeddings capture the contextual meaning of the text and lay the foundation for your semantic search functionality.
Enable your backend system to instantly convert user input into embeddings. For this purpose, take advantage of NuPIC's Python client, which simplifies interaction with the NuPIC Inference Server.
By making a single API call using this client, your system can swiftly generate embeddings in real-time in the same way as in the previous step.
With the user input transformed into an embedding, proceed to query the vector database with it to get the content closest to the input embedding. This content aligns with the user's intent and allows you to pull the corresponding travel guide information. Now, present these tailored recommendations in an appealing and user-friendly manner on your website. By following these steps, you can harness the capabilities of LLMs to craft a semantic search experience that's both intuitive and intelligent. This approach transcends traditional keyword-based search engines, allowing you to offer your users personalized interactions that truly grasp the essence of their request.
This semantic search example is just one simple use case of NuPIC. You can improve this same example by merging duplicate requests or prioritizing them by importance. Or you can move and enhance other applications, such as Question & Answering (Q&A), and speed up the retrieval of important information in contracts or legal documents. You can also use techniques like Retrieval-Augmented Generation (RAG) for the BERT models, to add specific domain knowledge and reduce hallucination in an LLM.
Now, let's put NuPIC to the test by comparing its performance against Hugging Face's BERT on CPU and GPU. We'll conduct a straightforward experiment to test the efficiency and speed of NuPIC.
You can find all the code and instructions in this repo.
We'll use the Financial Sentiment Analysis dataset and measure the time required to generate embeddings for each model:
All CPU experiments ran on a 4th Generation Intel Xeon
from a m7i.2xlarge
AWS EC2 instance, while the GPU experiment ran on a NVIDIA A100
.
First, we need to import and setup the NuPIC's Client to connect to our running NuPIC Inference Server. Then, we iterate over the sentences and use NuPIC's SBERT to get the embedding. Finally, we measure the time it takes with the following script:
To make a fair comparison to NuPIC, we will use Hugging Face’s BERT Large model, which has almost the same architecture as the numenta-sbert-2-v2
model we used before. We will only use one thread for inference as the NuPIC Inference Server assigns one thread for each request.
First, let's import the necessary libraries. Next, load the BERT model and tokenizer required. Finally, measure the time it takes to run the inference from text to tokens and to embedding on the whole dataset.
Lastly, let's measure the inference time for Hugging Face's BERT on GPU, by simply replacing the "cpu"
with "cuda"
at the beginning of the previous script. Remember that this experiment also uses a batch size of 1 for inference.
Now that we've measured the inference time for NuPIC's SBERT, Hugging Face's BERT both on CPU and GPU, we can compare the results, which show the efficiency of NuPIC's SBERT for CPU-based NLP tasks.
NuPIC | HF (CPU) | HF (GPU) | |
---|---|---|---|
Execution time | 21.2ms | 302.9ms | 23.6ms |
Max hourly throughput* | ~1.358.000 | ~95.000 | ~1.220.000 |
Cost per 1M requests** | 1.18 | 16.84 | 26.80 |
The CPU experiments were run on an AWS EC2 m7i.2xlarge
instance, and the GPU experiment was run on Google Colab with an NVIDIA A100
.
*With a maximum of 8 concurrent requests or a batch size of 8.
**Using AWS as a reference, with a p4d.24xlarge
as GPU instance.
In our experiments, NuPIC's SBERT demonstrates a remarkable 15x speed for CPU inference time and a 10% speed improvement when compared with a GPU, and between 14 and 22 times cost reduction, making it a compelling choice for CPU-based LLMs, especially in scenarios where GPU resources may be limited or costly to deploy.
An important point to remark is that NuPIC handles each request independently and does not require any additional batching to make an efficient use of the computational resources as the GPU would. Another important point is that NuPIC can handle several models simultaneously while keeping the same throughput, while GPUs have significant throughput drops whenever you need to use a different model. This means that in practice, the throughput gains of NuPIC can be much higher than what this benchmark shows.
We tested this to ensure NuPIC's scalability using infrastructure as a code:
The result? A scalable LLM solution that effortlessly handled surges in workload without breaking a sweat.
So, there you have it, folks! NuPIC stands out as it achieves impressive 15x speedups and allows for an easy integration, allowing to deliver practical and efficient LLM solutions running on a CPU. Now let's point out some Pros and Cons of NuPIC.
NuPIC offers notable benefits but also has some limitations to consider, as summarized below:
Pros | Cons |
---|---|
Privacy and control: runs in your infrastructure | Current version has variants of BERT out of the box |
Much faster CPU inference speeds | Newer platform with less integration support |
Better price/performance ratio than GPUs | Documentation still being worked on |
Avoids GPU shortage and scaling issues | Manual process for existing models |
Fine-tuning module |
In this exploration and testing of NuPIC, we've managed to examine the results of neuroscience being applied to AI, achieving notable LLM inference speeds on CPUs. With experiments and usage examples, we've benchmarked NuPIC's capabilities to enhance inference speed, specifically for semantic search.
NuPIC does make a leap in reducing hardware limitations for language models. Be it conversational AI, personalized recommendations, or other language-based AI applications. This is a new tool, so what could the future bring along? Exciting times to be in the field of AI inference!
At Tryolabs, we have hands-on experience with NuPIC and are eager to guide you on this new frontier of efficient LLM inference. Our team stays on the cutting edge of AI advances so we can deliver strategic guidance and custom solutions leveraging the latest innovations like NuPIC. Contact us to learn how we can help you build transformative AI applications enhanced by next-generation technology.
© 2024. All rights reserved.