Cost Of Self Hosting Llama-3 8B-Instruct

Sid Premkumar,llama-3
Cover Image

Hackernews badge (opens in a new tab)

How much does it cost to self host a LLM?

⚡️ TLDR: Assuming 100% utilization of your model Llama-3 8B-Instruct model costs about $17 dollars per 1M tokens when self hosting with EKS, vs ChatGPT with the same workload can offer $1 per 1M tokens. Choosing to self host the hardware can make the cost <$0.01 per 1M token that takes ~5.5 years to break even.

A question we often get is how do you scale with ChatGPT. One thing we wanted to try is to determine the cost of self hosting an open model such as Llama-3.


Determining the best hardware

Context: All tests were run on a EKS cluster. Each test was allocated the resources of the entire node, nothing else was running on that node other than system pods (prometheus, nvidia-daemon, etc.)

We first wanted to start small, can we run it on a single Nvidia Tesla T4 GPU? I started with AWS’s g4dn.2xlarge instance. Its specs were as follows (source (opens in a new tab)):

The short answer is that running either the 8B param or the 70B param version of Llama 3 did not work on this hardware. We initially tried the 70B param version of Llama 3 and constantly ran into OOM issues. We then sobered up and tried the 8B param version. Although this time we was able to get a response, generating the response took what felt like ~10 minutes. I checked nvidia-smi and the single GPU was being used, it just wasn’t enough.

For context 8B vs 70B refers to the parameters in the model and generally translate to performance of the model. The advantage of the 8B param is its small size and resource footprint.

CUDA OOM

So I decided to switch to the g4dn.16xlarge instance type. The specs of that were as follows (source (opens in a new tab)):

Initial Implementation

My initial implementation involved trying to copy and paste the code present on the llama-3 hugging face (see (opens in a new tab))

import transformers
import torch
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline("text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")


import time

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

now = time.time()
print(f"Starting pipeline now: {now}...")
outputs = pipeline(
    messages,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
)
modelOutput = outputs[0]["generated_text"][-1]
later = time.time()
difference = int(later - now)
print(f"Prompt took: {difference} seconds to respond: {modelOutput}”)

These results seemed a lot more promising as the response time lowered from ~10m to sub 10 seconds consistently.

Initial Implementation Results

I was able to get a response in a much more reasonable 5-7 seconds. At this point I wanted to start calculating the cost of this request.

g5dn.12xlarge costs $3.912 per hour on demand (see (opens in a new tab)).

If we assume a full month of use, that costs:

Per hour cost * 24 * 30 = Total Cost
$3.912 * 24 * 30 = $2,816.64

I had trouble with getting the token count in hugging face so ended up using llama-tokenizer-js (opens in a new tab) to get an approximation of tokens used.

Llama-3 Tokenizer JS

Note: This ended up being incorrect way of calculating the tokens used.

If we look at this result, we assume we used 39 tokens over ~6 seconds. Assuming Im processing tokens 24/7, extrapolating that we get:

39 tokens over 6 seconds = 6.5 tokens/s

With 6.5 tokens per second, over the month I can process:

Tokens Per Second * 60 (to get to minutes) * 60 (to get to hours) * 24 (to get to days) * 30 (to get to month)  = Total Tokens Per Month
6.5 * 60 * 60 * 24 * 30 = 16,848,000

With 16,848,000 tokens per month, every million tokens costs:

Total Cost / Total Tokens processed => Cost per token * 1,000,000 => Cost Per 1M token
( 2,816.64 / 16,848,000 ) * 1,000,000 = $167.17

😵Yikes. It is not looking good. Lets compare this to what ChatGPT 3.5 can get us.

Looking at their pricing (opens in a new tab), ChatGPT 3.5 Turbo charges $0.5 per 1M input token, and $1.5 per 1M output token. For sake of simplicity, assuming an average input:output ratio, that means per 1M tokens they charge $1 and thats the number to beat. Far from my $167.17.

Realization Something Was Wrong

At this point I felt I was doing something wrong. Llama 3 with 8B params is hard to run, but I didn’t feel it should be this hard, especially considering I have 4 GPUs available 🤔.

I decided to try to use vLLM to host an API server instead of attempting to do it myself via hugging face libraries.

This was dead simple and just involved me installing ray and vllm via pip3 and then changing my docker entry point to:

python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size 4 --dtype=half --port 8000

Specifically noting that I call —tensor-parallel-size 4 to enforce that we use all 4 GPUs.

Using this got significantly better results.

Llama-3 Tokenizer JS

In this example you can see the query took 2044ms to complete. This was also the time I realized my previous method of calculating token usage was incorrect. vLLM returned the tokens used which was perfect.

Going back to cost, now we can calculate tokens per second:

Total Tokens / Total Time => Token/s
124/2.044 => 60.6 Token/s

Again assuming I produce tokens 24/7, this means that I can ingest a total of:

Tokens Per Second * 60 (to get to minutes) * 60 (to get to hours) * 24 (to get to days) * 30 (to get to month)
60.6 * 60 * 60 * 24 * 30 = 157,075,200

Which means the cost per 1 million tokens would cost me:

Total Cost / Total Tokens processed => Cost per token * 1,000,000 => Cost Per 1M token
( 2,816.64 / 157,075,200 ) * 1,000,000 = $17.93

Unfortunately this is still not below the value that ChatGPT offers, you’d lose about $17 a day 😭

An Unconventional Approach

Instead of using AWS another approach involves self hosting the hardware as well. Even after factoring in energy, this does dramatically lower the price.

Assuming we want to mirror our setup in AWS, we’d need 4xNVidia Tesla T4s. You can buy them for about $700 dollars on eBay

Ebay Listing

Add in $1,000 to setup the rest of the rig and you have a final price of around:

$2,800 + $1,000 = $3,800

If we calculate energy for this, we get about ~$50 bucks. I determined power consumption by 70W per GPU + 20W overhead:

Energy Cost

After factoring in the ~$3,800 fixed cost, you have a monthly cost of ~$50 bucks, lets round up to ~$100 to factor any misc items we might have missed.

Re-calculating our cost per 1M tokens now:

Total Cost / Total Tokens processed => Cost per token * 1,000,000 => Cost Per 1M token
( 100 / 157,075,200 ) * 1,000,000 = $0.000000636637738

Which is significantly cheaper than our ChatGPT costs.

Trying to determine when you’d break even, assuming you want to produce 157,075,200 tokens with ChatGPT, you’re looking at a bill of:

157 * Cost per 1M token ~= 157 * $1 = $157

You have a fixed cost of ~$100 a month, which results in a ‘profit’ of $57. To make up your initial server cost of $3,800, you’d need about 66 months or 5.5 years to benefit from this approach.

Although this approach does come with negatives such as having to manage and scale your own hardware, it does seem to be possible to undercut the prices that ChatGPT offer by a significant amount in theory. In pracitice however, you'd have to evaluate how often you are utilizing your LLM, these hypotheticals all assume 100% utilization which is not realistic and would have to be tailord per use case.

👨‍💻

📣 If you are building with AI/LLMs, please check out our project lytix (opens in a new tab). It's a observability + evaluation platform for all the things that happen in your LLM.

📖 If you liked this, sign up for our newsletter below. You can check out our other posts here (opens in a new tab).

© lytix