Controlling randomness in LLMs: Temperature and Seed

llm
openai
python
Author
Affiliation
Published

June 25, 2025

Modified

June 25, 2025

Temperature and seed are commonly used parameters when interacting with Large Language Models (LLMs). They’re also a source of confusion for many people. In this post, I’ll show you what they are and how they work.

Temperature is a parameter that controls the randomness of the output by scaling the logits of the tokens before applying the softmax function. Seed is also a parameter that controls the randomness of how the model selects tokens during text generation. It sets the initial state of the random number generator, which is then used for the sampling of the tokens during the generation process.

Temperature is available for most providers, while seed is only available for OpenAI and open-weight models (that I know of).

Let’s get started.

How LLMs generate text

To understand how seed and temperature work, we first need to understand how LLMs generate text. Provided with a prompt, a model uses what’s called a decoding strategy to generate the next token.

There are many strategies, but for this post, we’ll focus on just two: greedy search and sampling.

In greedy search, the model picks the token with the highest probability at each step. In sampling, the model picks a token based on the probability distribution of the tokens in the vocabulary. In both cases, the model will calculate the probability of each token in the vocabulary1, and use that to pick the next token. Let’s see an example.

Take the following prompt:

What’s the favorite dish of Chuck Norris?

These might be the top 5 most likely next tokens:

Rank Token Probability
1 ‘Dynamite’ 0.5823
2 ‘Venom’ 0.2891
3 ‘Himself’ 0.0788
4 ‘Radiation’ 0.0354
5 ‘You’ 0.0144

If the model uses greedy search, it will pick the token with the highest probability, which is ‘Dynamite’.

If it uses sampling, it will make a random selection based on those probabilities. So, the model has a 58% chance of picking ‘Dynamite’, a 29% chance of picking ‘Venom’, a 8% chance of picking ‘Himself’, a 4% chance of picking ‘Radiation’, and a 1% chance of picking ‘You’.

Let’s see how this works in practice, and how seed and temperature have an effect on the output.

Logits are the raw scores that the model assigns to each token. To go from logits to probabilities, you must apply the softmax function:

\[\text{P}(w_i) = \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}\]

Where:

  • \(P(w_i)\) is the probability of token \(w_i\)
  • \(z_i\) is the logit for token \(w_i\)
  • \(n\) is the total number of possible tokens

Logits are the raw output scores from the language model before any transformation. The softmax function gives you the probability assigned to each token.

Temperature

Temperature is a parameter that usually goes between 0 and 1 or 2, and it’s used to influence the randomness of the output. It does so by scaling the logits of the tokens by the temperature value.

You cannot know for sure how proprietary providers (OpenAI, Anthropic, etc.) implement temperature, but you can get a good idea of how it works by looking at TemperatureLogitWrapper in the transformers library.

It’s magic boils down to this:

scores_processed = scores / self.temperature

Temperature (\(T\)) simply scales the scores (logits) of the tokens by the temperature value. This, in turn, will change the probabilities of the tokens:

\[P(w_i) = \frac{e^{z_i / T}}{\sum_{j=1}^{n} e^{z_j / T}}\]

The example below shows how temperature affects the probabilities of the tokens:

import numpy as np

tokens = ['Dynamite', 'Venom', 'Himself', 'Radiation', 'You']
logits = np.array([2.5, 1.8, 0.5, -0.3, -1.2])

temperatures = [0.1, 0.5, 1.0, 1.5, 1.999999999]

for temperature in temperatures:
    probs = np.exp(logits / temperature) / np.sum(np.exp(logits / temperature))
    print(f"\nTemperature: {temperature:.2f}")
    print("What's the favorite dish of Chuck Norris?")
    print("Rank | Token      | Probability")
    print("-----|------------|------------")
    for i, (token, prob) in enumerate(zip(tokens, probs), 1):
        print(f"{i:4d} | '{token:10s}' | {prob:.4f}")
    print(f"Sum of probabilities: {np.sum(probs):.4f}")

This code simulates the impact of different temperature values on the next token probability. Given some initial logits and assuming this is the full vocabulary, we can calculate the probabilities of the tokens for a given temperature.

For a temperature of 0.1, you get the following probabilities:

Rank Token Probability
1 ‘Dynamite’ 0.9991
2 ‘Venom’ 0.0009
3 ‘Himself’ 0.0000
4 ‘Radiation’ 0.0000
5 ‘You’ 0.0000

For a temperature of 2, you get the following probabilities:

Rank Token Probability
1 ‘Dynamite’ 0.4038
2 ‘Venom’ 0.2846
3 ‘Himself’ 0.1486
4 ‘Radiation’ 0.0996
5 ‘You’ 0.0635

You can see that for lower temperature values, the model becomes more deterministic. For temperature 0.1, the probability of picking ‘Dynamite’ is >99.9%, while for temperature 2, it’s only 40%.

In essence, temperature impacts the randomness of the output by changing the probabilities of selecting the next token. This should give you a good idea of how temperature works. But let’s try it with a real LLM instead of a simulation.

First, let’s import the required libraries and load the model.

import numpy as np
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "unsloth/Qwen3-1.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

For the sake of this example, we’ll use unsloth/Qwen3-1.7B. But what you see here is applicable to most LLMs. We’ll use generate_text as our text generation function.

def generate_text(prompt, temperature, seed=None, print_top_k=False):
    if seed:
        torch.manual_seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed(seed)

    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    if temperature > 0:
        model_params = {
            "do_sample": True,
            "temperature": temperature if temperature < 2 else 1.9999999,
        }
    else:
        model_params = {
            "do_sample": False,
        }
    outputs = model.generate(
        **model_inputs,
        **model_params,
        max_new_tokens=1,
        output_scores=True,
        return_dict_in_generate=True,
        pad_token_id=tokenizer.eos_token_id
    )

    output_token_id = outputs.sequences[0][-1].tolist()
    selected_token = tokenizer.decode([output_token_id])

    if not print_top_k:
        return selected_token
    
    probs = F.softmax(outputs.scores[0][0], dim=-1)
    top_k_probs, top_k_indices = torch.topk(probs, 10)

    print("Top-10 most likely tokens:")
    for i, (prob, idx) in enumerate(zip(top_k_probs, top_k_indices)):
        token_text = tokenizer.decode([idx.item()])
        is_selected = "← SELECTED" if idx.item() == output_token_id else ""
        print(f"  {i+1}. '{token_text}' (prob: {prob.item():.4f}, logit: {outputs.scores[0][0][idx.item()].item():.4f}) {is_selected}")

    return selected_token

On a high-level, this function takes a prompt, a temperature value, and a seed, and returns the top 10 most likely tokens with their probabilities and logits. The implementation looks a bit complicated, so let’s break it down.

  1. Lines 2 to 16: It takes a prompt, a temperature value, and optionally a seed. If a seed is provided, it sets the random number generator to that value. Then, it processes the prompt to create the required input for the model.

  2. Lines 18 to 37: It chooses to sample from the model or not, based on the temperature value. If temperature is 0, the model will use to a greedy search strategy.

  3. Lines 39 to 50: It returns the completion token and optinally prints the top 10 most likely tokens with their probabilities and logits.

Similar to what you saw in the previous example, you can try low and high temperature values.

This is what you get for a temperature of 0.1:

token = generate_text("Tell me a joke about dogs", temperature=0.1, print_top_k=True)
Top-10 most likely tokens:
  1. 'Why' (prob: 1.0000, logit: 330.0000) ← SELECTED
  2. '!' (prob: 0.0000, logit: -inf) 
  3. '"' (prob: 0.0000, logit: -inf) 
  4. '#' (prob: 0.0000, logit: -inf) 
  5. '$' (prob: 0.0000, logit: -inf) 
  6. '%' (prob: 0.0000, logit: -inf) 
  7. '&' (prob: 0.0000, logit: -inf) 
  8. ''' (prob: 0.0000, logit: -inf) 
  9. '(' (prob: 0.0000, logit: -inf) 
  10. ')' (prob: 0.0000, logit: -inf) 
token = generate_text("Tell me a joke about dogs", temperature=1.99, print_top_k=True)
Top-10 most likely tokens:
  1. 'Why' (prob: 0.5742, logit: 16.5829) ← SELECTED
  2. 'Sure' (prob: 0.3939, logit: 16.2060) 
  3. 'Here' (prob: 0.0319, logit: 13.6935) 
  4. '!' (prob: 0.0000, logit: -inf) 
  5. '"' (prob: 0.0000, logit: -inf) 
  6. '#' (prob: 0.0000, logit: -inf) 
  7. '$' (prob: 0.0000, logit: -inf) 
  8. '%' (prob: 0.0000, logit: -inf) 
  9. '&' (prob: 0.0000, logit: -inf) 
  10. ''' (prob: 0.0000, logit: -inf) 

You should see similar results. For the “Tell me a joke about dogs” prompt, when using a temperature of 0.1, the model had ~100% probability of picking ‘Why’, while for temperature 2, it’s only 57%.

Note, that when temperature is 0, the model will use to a greedy search strategy, which is the same as picking the most likely token. So no sampling is done and results are deterministic.

Seed

The seed parameter controls the randomness of how a model selects tokens. It sets the initial state for the random number generator used in the token sampling process.

Let’s revisit the example from the previous section to see this in action. By setting the seed to a fixed value, you ensure the generation process is deterministic. This means you will get an identical result on every run, provided all other parameters (like temperature) remain the same in those runs.

We can start by setting our seed to 42 and temperature to 1 to verify which token is generated.

generate_text("Tell me a joke about dogs", temperature=1, seed=42, print_top_k=True)
Top-10 most likely tokens:
  1. 'Why' (prob: 0.6792, logit: 33.0000) 
  2. 'Sure' (prob: 0.3208, logit: 32.2500) ← SELECTED
  3. '!' (prob: 0.0000, logit: -inf) 
  4. '"' (prob: 0.0000, logit: -inf) 
  5. '#' (prob: 0.0000, logit: -inf) 
  6. '$' (prob: 0.0000, logit: -inf) 
  7. '%' (prob: 0.0000, logit: -inf) 
  8. '&' (prob: 0.0000, logit: -inf) 
  9. ''' (prob: 0.0000, logit: -inf) 
  10. '(' (prob: 0.0000, logit: -inf) 
'Sure'

In this case, the model selected “Sure” as the next token, even though its probability is lower than ‘Why’. Now, we can verify that this stays the same over multiple runs.

tokens = []
for i in range(100):
    token = generate_text("Tell me a joke about dogs", temperature=1, seed=42)
    tokens.append(token)
assert len(set(tokens)) == 1
print(set(tokens))
{'Sure'}

This code runs the text generation process 100 times and verifies that “Sure” was picked in all runs. Next, we should verify that this consistency is lost when we omit the seed parameter.

tokens = []
for i in range(100):
    token = generate_text("Tell me a joke about dogs", temperature=1)
    tokens.append(token)
assert len(set(tokens)) > 1
print(set(tokens))
{'Sure', 'Why'}

In this case, you see that after the 100 generations, the model picked two different tokens: ‘Sure’ and ‘Why’. This is expected due to not setting a seed.

You can also use test this with a propietary model. Let’s try it with gpt-4.1-nano from OpenAI.

import openai

from dotenv import load_dotenv

load_dotenv()

client = openai.OpenAI()

def generate_text_openai(prompt, temperature, seed=None, print_top_k=False):
    response = client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        seed=seed,
        max_tokens=1,
        logprobs=True,
        top_logprobs=10,
    )
    selected_token = response.choices[0].message.content
    if print_top_k:
        logprobs = response.choices[0].logprobs.content[0].top_logprobs
        print("Top 10 most likely tokens:")
        for idx, token_info in enumerate(logprobs):
            token = token_info.token
            logprob = token_info.logprob
            prob = np.round(np.exp(logprob)*100,2)
            token_text = f"{idx+1}. '{token}': {prob:.4f} ({logprob:.4f})"
            is_selected = "← SELECTED" if token_info.token == selected_token else ""
            print(f"{token_text} {is_selected}")
    return selected_token

Similar to the previous function, you provide a prompt, a temperature value, and a seed, and the model will return a completion token and will print the top 10 most likely tokens.

In this case, instead of providing you with the logits, OpenAI will provide you with logprobs which are the logaritmic probabilities of the tokens:

\[logprob(w_i) = ln(P(w_i)) = ln(\frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}) \]

First, let’s check the completion token we get for a temperature of 1 and a seed of 42.

token = generate_text_openai("Tell me a joke about dogs", temperature=1, seed=42, print_top_k=True)
Top 10 most likely tokens:
1. 'Why': 59.2600 (-0.5232) ← SELECTED
2. 'Sure': 40.7300 (-0.8982) 
3. ' Why': 0.0000 (-10.6482) 
4. ' sure': 0.0000 (-11.0232) 
5. ' why': 0.0000 (-11.2732) 
6. '为什么': 0.0000 (-11.6482) 
7. ' Sure': 0.0000 (-11.8982) 
8. 'Pourquoi': 0.0000 (-12.2732) 
9. 'why': 0.0000 (-12.3982) 
10. 'sure': 0.0000 (-12.6482) 

In this case, we get ‘Why’ as the completion token. You can see that the top 10 most likely tokens are not the same as the ones we got with Qwen3-1.7B. This is expected, as the model is different.

Then, we can try to generate 100 tokens with a temperature of 1 and a seed of 42.

tokens = []
for i in range(100):
    token = generate_text_openai("Tell me a joke about dogs", temperature=1, seed=42)
    tokens.append(token)
assert len(set(tokens)) == 1
print(set(tokens))
{'Why'}

Similar to the previous example, we run 100 generations with the same seed and temperature and check if the completion token is the same.

This should generally work, but OpenAI doesn’t guarantee that the same seed will always produce the same output. It might occur that your request is handled by a model with a different configuration, and you’ll get different results.

You can also verify that not using a seed will result in different tokens.

tokens = []
for i in range(100):
    token = generate_text_openai("Tell me a joke about dogs", temperature=1)
    tokens.append(token)
assert len(set(tokens)) > 1
print(set(tokens))
{'Sure', 'Why'}

Now, you can see that the output is not the same in all runs. Some runs picked “Why”, and others picked “Sure”.

In essence, seed influences the output by setting the initial state of the random number generator, which is then used for the sampling of the tokens during the generation process.

top-k and top-p

In addition to temperature, there are two other parameters that are commonly used to control the randomness of the output of a language model: top-k and top-p.

top-k

Top-k sampling is a technique that limits the number of tokens that can be selected from the vocabulary. It does so by keeping only the top-k tokens with the highest probabilities. This reduces the computational workload by getting the top-k logits and then calculating the softmax over these instead of using the complete vocabulary.

This parameter isn’t available for OpenAI models. They provide a top_logprobs parameter, but it’s not the same as top-k sampling. It’s a parameter that returns the top N most likely tokens with their logprobs, but it doesn’t change the sampling process.

top-p

Top-p sampling is a technique that limits the number of tokens that can be selected from the vocabulary. It does so by keeping only the tokens with the highest probabilities that cumulatively account for at least p% of the total probability mass. So it selects the fewest possible tokens to reach the desired probability mass.

This is available for most providers.

generate_text_openai("Tell me a joke about dogs", top_p=0.50, print_top_k=True)
Top 10 most likely tokens:
1. 'Why': 59.2600 (-0.5232) ← SELECTED
2. 'Sure': 40.7300 (-0.8982) 
3. ' Why': 0.0000 (-10.6482) 
4. ' sure': 0.0000 (-11.0232) 
5. ' why': 0.0000 (-11.2732) 
6. '为什么': 0.0000 (-11.6482) 
7. ' Sure': 0.0000 (-11.8982) 
8. 'Pourquoi': 0.0000 (-12.2732) 
9. 'why': 0.0000 (-12.3982) 
10. 'sure': 0.0000 (-12.6482) 
'Why'

Seed and temperature in practice

Now that you understand how seed and temperature work, here are some things to keep in mind when using them:

  1. seed is only available for OpenAI and open-weight models.
  2. To get the most deterministic output for a given prompt, set temperature to 0. This minimizes randomness.
  3. If you want creative results that are still reproducible, set temperature to a value greater than 0 and use a fixed seed. This allows for varied outputs that you can generate again.
  4. If you don’t need reproducible results and want unique outputs on every run, you can omit the seed parameter entirely.
  5. Be aware that even if you set a temperature of 0 and a seed, outputs are not guaranteed to be identical. Providers might change model configurations that might impact the output. For OpenAI models, you can monitor such changes by keeping track of the system_fingerprint provided in the responses.

Conclusion

In this post, we explored how the temperature and seed parameters control the output of Large Language Models.

You learned that temperature adjusts the level of randomness: low values (near 0) produce more predictable, deterministic outputs, while high values (near 1) encourage more creative and varied results. In contrast, the seed makes the generation process reproducible. While the specific seed value isn’t important, fixing it ensures you get the same output for a given prompt and set of parameters.

Finally, remember that while temperature is a near-universal setting, seed is only available (at the time of writing) for OpenAI and open-weight models.

I hope you found this post useful. If you have any questions, let me know in the comments below.

Footnotes

  1. Modern LLMs often have a vocabulary of 100k+ tokens↩︎

Citation

BibTeX citation:
@online{castillo2025,
  author = {Castillo, Dylan},
  title = {Controlling Randomness in {LLMs:} {Temperature} and {Seed}},
  date = {2025-06-25},
  url = {https://dylancastillo.co/posts/seed-temperature-llms.html},
  langid = {en}
}
For attribution, please cite this work as:
Castillo, Dylan. 2025. “Controlling Randomness in LLMs: Temperature and Seed.” June 25, 2025. https://dylancastillo.co/posts/seed-temperature-llms.html.