Exercise#6 Decoding Parameters



  1. Review the model’s support for the various decoding hyperparameters
  2. Learn to use the decoding hyperparameters.

Part-1 Review inference/decoding parameters supported by models

Google Gemini


Parameter Description Type Default
candidate_count Number of generated responses to return. Integer Model-dependent
stop_sequences The set of character sequences (up to 5) that will stop output generation. List of Strings None
max_output_tokens The maximum number of tokens to include in a candidate. Integer Model-dependent
temperature Controls the randomness of the output. Float (0.0 - 1.0) Model-dependent
top_p Maximum cumulative probability of tokens to consider when sampling. Float (0.0 - 1.0) Model-dependent
top_k Maximum number of tokens to consider when sampling. Integer 40
response_mime_type Output response mimetype. String (text/plain, application/json) text/plain
response_schema Specifies the format of the JSON response. JSON Schema None
  1. candidate_count is a model specific parameter. It instructs the model to generate specified number of responses per query.

  2. The presence_penalty & frequency_penalty are NOT supported

  3. Models support JSON output via a decoder hyperparameter response_schema

AI21 Jurassic-2 Models


Parameter Description Minimum Maximum Default
temperature Controls randomness of response. 0 1 0.5
topP Controls probability threshold for token selection. 0 1 0.5
maxTokens Maximum number of tokens in the generated response. 0 8,191 (Mid, Ultra, Large) 2,048 (Other)
stopSequences Sequences to terminate generation. N/A N/A N/A
presencePenalty.scale Penalizes tokens that appear in the prompt or completion. 0 5 0
countPenalty.scale Penalizes tokens based on frequency of appearance. 0 1 0
frequencyPenalty.scale Penalizes tokens based on normalized frequency. 0 500 0
countPenalty.applyToWhitespaces, countPenalty.applyToPunctuations, countPenalty.applyToNumbers, countPenalty.applyToStopwords, countPenalty.applyToEmojis Applies penalty to special characters. N/A N/A True
countPenalty.applyToWhitespaces Applies penalty to whitespaces and new lines. N/A N/A True
countPenalty.applyToPunctuations Applies penalty to punctuation. N/A N/A True
countPenalty.applyToNumbers Applies penalty to numbers. N/A N/A True
countPenalty.applyToStopwords Applies penalty to stop words. N/A N/A True
countPenalty.applyToEmojis Excludes emojis from penalty. N/A N/A True

Mistral parameters


Parameter Description Type Default Minimum Maximum
max_tokens Maximum number of tokens in the generated response. Integer Model-dependent 1 Model-dependent
stop List of stop sequences to terminate generation. List of Strings 0 0 10
temperature Controls randomness of predictions. Float (0.0 - 1.0) Model-dependent 0 1
top_p Controls diversity of generated text. Float (0.0 - 1.0) Model-dependent 0 1
top_k Controls the number of most-likely candidates considered. Integer Model-dependent 1 200

Cohere command


Parameter Description Type Default Minimum Maximum
temperature Controls randomness of predictions. Float 0.9 0 5
p Top P; controls probability threshold for token selection. Float 0.75 0 1
k Top K; controls number of token choices for next token. Integer 0 0 500
max_tokens Maximum number of tokens in the generated response. Integer 20 1 4096
stop_sequences List of sequences to terminate generation. List of Strings N/A N/A N/A
return_likelihoods Specifies if and how token likelihoods are returned. String (GENERATION, ALL, NONE) NONE N/A N/A
stream Indicates if the response should be streamed. Boolean False N/A N/A
num_generations Maximum number of generations to return. Integer 1 1 5
logit_bias Prevents or encourages certain tokens. Dictionary (token_id: bias) N/A -10 10
truncate Specifies how to handle inputs longer than the maximum token length. String (NONE, START, END) END N/A N/A



Parameter Description Type Required Default Minimum Maximum
max_tokens_to_sample Maximum number of tokens to generate (recommended limit: 4,000). Integer Yes 200 0 4096 (Model-dependent)
stop_sequences Optional sequences to stop generation (default includes “\n\nHuman:”). List of Strings No N/A N/A N/A
temperature Controls randomness of response (0: less random, 1: more creative). Float No 1 0 1
top_p Probability threshold for nucleus sampling (use only one of temperature or top_p). Float No 1 0 1
top_k Sample only from the top K most likely tokens. Integer No 250 0 500

HuggingFace endpoint Parameters

These parameters apply to multiple HuggingFace models exposed as endpoints on HuggingFace platform.


Parameter Description Type Default
min_length Minimum length of the output summary in tokens. Integer None
max_length Maximum length of the output summary in tokens. Integer None
top_k Top tokens considered within the sample operation. Integer None
top_p Cumulative probability threshold for token selection. Float None
temperature Controls randomness of predictions. Float (0.0-100.0) 1.0
repetition_penalty Penalizes tokens based on frequency of appearance. Float (0.0-100.0) None
max_time Maximum time for query execution in seconds. Float (0-120.0) None

Part-2 Learn to use the inference parameter

You will use Google Gemini for testing, but feel free to use any other model of your choice. Intent is to learn the changes in model behavior with the adjustment of the parameters.

Familiarize with Gemini inference API

Understanding the API in use is optional. You may use any model for checking out the behavior changes with changes in the hyperparameters.

Review the API request body for Google models. You will find that there are multiple configuration parameters supported by the endpoint. Our interest is in the decoder parameters which are specified by an object referred to as GenerationConfig, shown below for your conveneince.

Gemini API

Setting the parameters

# DO NOT COPY to notebook
# This is for reference only
# Notice that presence_penalty & frequency_penalty are NOT supported
    candidate_count: (int | None) = None,
    stop_sequences: (Iterable[str] | None) = None,
    max_output_tokens: (int | None) = None,
    temperature: (float | None) = None,
    top_p: (float | None) = None,
    top_k: (int | None) = None,
    response_mime_type: (str | None) = None,
    response_schema: (protos.Schema | Mapping[str, Any] | None) = None

Create a new notebook

Use the path in your template folder for creating the notebook.



1. Setup Google API key
import getpass
import google.generativeai as genai
from google.generativeai import GenerationConfig, GenerativeModel

google_api_key = getpass.getpass()

2. Create LLM client with default parameters

Use code below to setup the LLM client with default values for the parameters.

model = "gemini-1.5-flash"

# Defaults

# Create the model with default parameter set values
llm = GenerativeModel(model)
3. Run a query with default parameters
query = "Explain LLM briefly"

response = llm.generate_content([query]) #, generation_config=generation_config)

# Extract the content from the response
response_text = response.candidates[0].content.parts[0].text

4. Test out with different parameter values

1. max_output_tokens, stop_sequences

Change the values of various parameters to see a change in the responses.

# Set the parameters and 
generation_config = GenerationConfig( 
    stop_sequences=None   #["**", "/n/n"]

# Generate a response
response = llm.generate_content([query], generation_config=generation_config)

# Extract the content from the response
response_text = response.candidates[0].content.parts[0].text

# Print the results

2. temperature, top_p, top_k

Change the values of various parameters to see a change in the responses.

query = "describe an LLM with 5 sentences"

# Switch to defaults
generation_config = GenerationConfig( 

llm = GenerativeModel(model,generation_config=generation_config)

# print response with defaults
response = llm.generate_content([query])
response_text = response.candidates[0].content.parts[0].text


