/assets/blog/from-brain-to-binary-cpus-future-ai-inference/numenta-104037d694.png
blog
From Brain to Binary: can neuro-inspired research make CPUs the future of AI Inference?

Wed, Nov 22, 2023

In the ever-evolving landscape of AI, the demand for powerful Large Language Models (LLMs) has surged. This has led to an unrelenting thirst for GPUs and a shortage that causes headaches for many organizations. But what if there was a way to use LLMs efficiently without relying solely on scarce GPUs?

Numenta, a Bay Area-based company with nearly two decades in neuroscience research — possibly recognized for the influential book A Thousand Brains: A New Theory of Intelligence published by its Co-Founder — has recently launched its first commercial product: NuPIC.

The name stands for “Numenta Platform for Intelligent Computing,” and it's an LLM acceleration platform that applies neuroscience principles to achieve remarkable speedups on CPU-based LLM inference, promising even to surpass GPU performance in many scenarios.

Image showing numenta's identity. Today’s AI is based on neuroscience from the ‘50s and ‘60s. Imagine what it could do if it incorporated the lates break-throughs.Source: Numenta.

Through a partnership with Numenta, we got early access to NuPIC and in this blog post, we will explore the platform’s architecture and then delve into its capabilities, showing you how to solve a classical use case of semantic search. To test the CPU vs GPU claims, we will do comparative benchmarks against Hugging Face's BERT on CPU and GPU and set the stage for discussing its pros, cons, and potential applications.

Join us on this exploration as we examine NuPIC, the claims surrounding its performance, and its potential role in LLM inference.

Full disclosure:

Tryolabs has partnered with Numenta to support the development of NuPIC. Our team worked closely with Numenta engineers to design and implement key infrastructure for NuPIC's modules, including the Inference Server and Training Module.

This hands-on involvement has provided us with valuable technical expertise and experience using NuPIC. While we aim to provide an objective view of the platform, our collaboration with Numenta is relevant to disclose upfront.

lightbulb.svg

But what is NuPIC?

NuPIC aims to accelerate LLM inference on CPUs. But what does that look like in practice? It currently consists of two docker containers, which allow easy integration into any AI inference infrastructure:

1. NuPIC Optimized Inference Server

This server runs on a CPU and presents a simple interface to get LLM model inference for different NLP tasks, such as classification, Q&A, or embeddings. Key features:

Key features:

  • Handles multiple simultaneous requests, as each inference step runs in a single core, and several simultaneous models without additional configuration.
  • Achieves better throughput than an NVIDIA A100 GPU when running on a 4th gen Intel Xeon CPU (according to this case study) by leveraging the AVX-512 and AMX instruction sets.

NuPIC is delivered with a set of optimized BERT and BERT Large models.

lightbulb.svg

2. NuPIC Training Module

This module runs on GPUs and allows fine-tuning the optimized models on your own data.

Key features:

  • Improves application accuracy.
  • Output models can be transferred to the Inference Server to start using in your application.

The NuPIC Training Module includes a Weight & Biases integration for experiment tracking.

lightbulb.svg

With NuPIC's architecture in mind, let's see it in action for a semantic search use case.

Imagine you're a developer building a content-rich website for a travel agency. Your goal is to empower users to find their dream destinations effortlessly, making the most out of the intent in their words.

Illustration showing a developer looking through different images of dream destinations.

In the world of travel, words carry a world of meaning. You want users to type in their travel desires, whether it's "beach paradises," "cultural escapades," or "adventurous getaways”, and receive tailored recommendations that capture the essence of their desires. Enter semantic search – the art of understanding not just keywords but the intent behind them. But the question is: how to perform this intent understanding?

The challenge: machine intent-understanding

Traditional keyword-based search engines often fall short when capturing semantic nuances; that’s why an AI approach is more suitable here, given the recent rise of LLMs and their interpretation power.

A BERT-based model is appropriate for this task, as they are specifically trained to represent text with embeddings — a mathematical representation that encapsulates the contextual meaning of the text. In particular, Sentence-BERT, or SBERT for short, is a powerful language model specifically trained to better understand the structure of whole sentences, much more than its predecessors. Thus making it doubly appropriate for retrieving meaningful embeddings and paving the way for a more intelligent and intuitive search experience.

The solution: NuPIC's SBERT

  • Pre-trained magic: NuPIC's SBERT model was pre-trained on a vast text corpus, including travel guides, blogs, and reviews. This means it understands the intricacies of travel lingo, from the excitement of discovering hidden gems to the tranquility of sunset beach strolls.
  • Embeddings unleashed: Let's say a user searches for "secluded mountain retreats.” NuPIC's SBERT converts this query into an embedding that captures the essence of this phrase.
  • Mapping meanings: Your website's content, from travel package descriptions to user reviews, is also transformed into embeddings using SBERT and stored into a vector dataset. Now, it's not just about matching keywords; it's about matching meanings.
  • Tailored recommendations: You can provide excellent recommendations by retrieving the closest embeddings to the user request! But you're not just providing recommendations based on keyword matches; you're offering suggestions that resonate with the user's intent. If users seek the serenity of mountains, they won't be bombarded with options for bustling city breaks.

Implementation steps

Now, we'll guide you through implementing your search functionality using NuPIC's SBERT model. From infrastructure setup to real-time embeddings and personalized recommendations, let's explore the journey to enriching user interactions and understanding their intent.

Image showing 4 different steps for using NuPIC's SBERT model.
0. Getting access to NuPIC

As of September 2023, NuPIC is available, and you can contact Numenta to request a demo.

1. Setting up the inference server: deploy NuPIC

To get started, you'll need a Linux machine equipped with an Intel CPU that supports the AMX instruction set, at least 6GB of RAM, and has Docker installed.

Then, you must download NuPIC’s Inference Server container using the Numenta-provided download.sh script and run it with the run_inference_server.sh script.

This server will play a vital role in generating embeddings and facilitating semantic searches.

2. Building your knowledge base: vector dataset population

Vector Datasets, such as Chroma or Pinecone, are a great solution to store, manage, and query embeddings. You must configure your vector dataset and keep all the travel guide summaries as embeddings.

You'll need to convert the travel guide summaries into embeddings using NuPIC's SBERT model to achieve this. Thankfully, it’s very straightforward: here is an example code of how to generate the embeddings using the NuPIC client in Python for a fixed set of sentences:

The generated embeddings capture the contextual meaning of the text and lay the foundation for your semantic search functionality.

3. On-the-fly semantics: real-time embedding generation

Enable your backend system to instantly convert user input into embeddings. For this purpose, take advantage of NuPIC's Python client, which simplifies interaction with the NuPIC Inference Server.

By making a single API call using this client, your system can swiftly generate embeddings in real-time in the same way as in the previous step.

4. Matching meanings, not just keywords: content retrieval

With the user input transformed into an embedding, proceed to query the vector database with it to get the content closest to the input embedding. This content aligns with the user's intent and allows you to pull the corresponding travel guide information. Now, present these tailored recommendations in an appealing and user-friendly manner on your website. By following these steps, you can harness the capabilities of LLMs to craft a semantic search experience that's both intuitive and intelligent. This approach transcends traditional keyword-based search engines, allowing you to offer your users personalized interactions that truly grasp the essence of their request.

This semantic search example is just one simple use case of NuPIC. You can improve this same example by merging duplicate requests or prioritizing them by importance. Or you can move and enhance other applications, such as Question & Answering (Q&A), and speed up the retrieval of important information in contracts or legal documents. You can also use techniques like Retrieval-Augmented Generation (RAG) for the BERT models, to add specific domain knowledge and reduce hallucination in an LLM.

alert.svg

Benchmarking NuPIC against BERT

Now, let's put NuPIC to the test by comparing its performance against Hugging Face's BERT on CPU and GPU. We'll conduct a straightforward experiment to test the efficiency and speed of NuPIC.

You can find all the code and instructions in this repo.

terminal-browser.svg

Experiment setup

We'll use the Financial Sentiment Analysis dataset and measure the time required to generate embeddings for each model:

  1. NuPIC's SBERT Large
  2. Hugging Face's BERT Large (CPU)
  3. Hugging Face's BERT Large (GPU)

All CPU experiments ran on a 4th Generation Intel Xeon from a m7i.2xlarge AWS EC2 instance, while the GPU experiment ran on a NVIDIA A100.

lock.svg

NuPIC's SBERT

First, we need to import and setup the NuPIC's Client to connect to our running NuPIC Inference Server. Then, we iterate over the sentences and use NuPIC's SBERT to get the embedding. Finally, we measure the time it takes with the following script:

Hugging Face's BERT on CPU

To make a fair comparison to NuPIC, we will use Hugging Face’s BERT Large model, which has almost the same architecture as the numenta-sbert-2-v2 model we used before. We will only use one thread for inference as the NuPIC Inference Server assigns one thread for each request.

First, let's import the necessary libraries. Next, load the BERT model and tokenizer required. Finally, measure the time it takes to run the inference from text to tokens and to embedding on the whole dataset.

Hugging Face's BERT on GPU

Lastly, let's measure the inference time for Hugging Face's BERT on GPU, by simply replacing the "cpu" with "cuda" at the beginning of the previous script. Remember that this experiment also uses a batch size of 1 for inference.

Results and comparison

Now that we've measured the inference time for NuPIC's SBERT, Hugging Face's BERT both on CPU and GPU, we can compare the results, which show the efficiency of NuPIC's SBERT for CPU-based NLP tasks.

NuPICHF (CPU)HF (GPU)
Execution time21.2ms302.9ms23.6ms
Max hourly throughput*~1.358.000~95.000~1.220.000
Cost per 1M requests**1.1816.8426.80

The CPU experiments were run on an AWS EC2 m7i.2xlarge instance, and the GPU experiment was run on Google Colab with an NVIDIA A100.

*With a maximum of 8 concurrent requests or a batch size of 8.

**Using AWS as a reference, with a p4d.24xlarge as GPU instance.

alert.svg

In our experiments, NuPIC's SBERT demonstrates a remarkable 15x speed for CPU inference time and a 10% speed improvement when compared with a GPU, and between 14 and 22 times cost reduction, making it a compelling choice for CPU-based LLMs, especially in scenarios where GPU resources may be limited or costly to deploy.

An important point to remark is that NuPIC handles each request independently and does not require any additional batching to make an efficient use of the computational resources as the GPU would. Another important point is that NuPIC can handle several models simultaneously while keeping the same throughput, while GPUs have significant throughput drops whenever you need to use a different model. This means that in practice, the throughput gains of NuPIC can be much higher than what this benchmark shows.

How to Scale NuPIC

We tested this to ensure NuPIC's scalability using infrastructure as a code:

  • We began by creating a template instance with the NuPIC inference server, ensuring that every new instance inherits the same configuration and settings.
  • Then, we set up the autoscaling service to allow us to adjust the computational resources to match real-time demand.
  • And finally, we set the required network services as the load balancer, firewall, and proxies.

The result? A scalable LLM solution that effortlessly handled surges in workload without breaking a sweat.

lightbulb.svg

So, there you have it, folks! NuPIC stands out as it achieves impressive 15x speedups and allows for an easy integration, allowing to deliver practical and efficient LLM solutions running on a CPU. Now let's point out some Pros and Cons of NuPIC.

Pros and Cons of NuPIC

NuPIC offers notable benefits but also has some limitations to consider, as summarized below:

ProsCons
Privacy and control: runs in your infrastructureCurrent version has variants of BERT out of the box
Much faster CPU inference speedsNewer platform with less integration support
Better price/performance ratio than GPUsDocumentation still being worked on
Avoids GPU shortage and scaling issuesManual process for existing models
Fine-tuning module

Pros

  • Privacy and control. Runs fully in your infrastructure so you maintain control over data and models. Whether you operate in the cloud or prefer an on-premise environment, NuPIC's privacy policy allows you to make strategic decisions aligned with your organization's unique needs.
  • CPU efficiency. Optimized to achieve much faster LLM inference on CPUs vs GPUs. In our benchmarking experiments, NuPIC achieved much faster LLM inference speeds on CPUs compared to BERT on both CPU and GPU.
  • Cost-effectiveness. Better price performance ratio compared to expensive GPUs. A cost-effective solution could be achieved without compromising performance by using CPUs for LLM.
  • Availability. Avoids GPU shortage issues. Can scale CPU resources more easily.
  • Fine-tuning module. NuPIC includes a simple training module that allows you to fine-tune NuPIC models with your own data while keeping all the performance optimization.

Cons

  • It is a Paid Framework. As we’ve said before, you’ll need to contact Numenta’s team to get your copy of NuPIC. You’ll need to do the math based on the pricing and the money you’ll be saving on GPUs (which are expensive).
  • Current version has variants of BERT out of the box, with GPT models likely coming in the future. If you are currently working with a different model architecture, you should consider moving to a BERT-like model to get the full optimization that NuPIC offers.
  • Emerging community. As a newer platform, NuPIC has an emerging user community. While resources are still limited today, early adopters can help shape future growth through real-world usage and feedback.
  • Manual process for existing models. You can't just plug-and-play existing models yet. You will need the help of Numenta to put your model through a manual process to convert it to NuPIC format.

Closing thoughts

In this exploration and testing of NuPIC, we've managed to examine the results of neuroscience being applied to AI, achieving notable LLM inference speeds on CPUs. With experiments and usage examples, we've benchmarked NuPIC's capabilities to enhance inference speed, specifically for semantic search.

NuPIC does make a leap in reducing hardware limitations for language models. Be it conversational AI, personalized recommendations, or other language-based AI applications. This is a new tool, so what could the future bring along? Exciting times to be in the field of AI inference!

At Tryolabs, we have hands-on experience with NuPIC and are eager to guide you on this new frontier of efficient LLM inference. Our team stays on the cutting edge of AI advances so we can deliver strategic guidance and custom solutions leveraging the latest innovations like NuPIC. Contact us to learn how we can help you build transformative AI applications enhanced by next-generation technology.

Wondering how AI can help you?