NAME

OpenAPI::Client::OpenAI::Path::realtime-sessions - Documentation for the /realtime/sessions path.

OPERATIONS

POST /realtime/sessions

create-realtime-session

$client->create_realtime_session({
    body => { ... },
});

Create an ephemeral API token for use in client-side applications with the Realtime API. Can be configured with the same session parameters as the session.update client event.

It responds with a session object, plus a client_secret key which contains a usable ephemeral API token that can be used to authenticate browser clients for the Realtime API.

Returns the created Realtime session object, plus an ephemeral key.

Responses

200 - Session created successfully.

Content-Type: application/json

Example:

{
   "audio" : {
      "input" : {
         "format" : {
            "rate" : 24000,
            "type" : "audio/pcm"
         },
         "noise_reduction" : null,
         "transcription" : {
            "model" : "whisper-1"
         },
         "turn_detection" : null
      },
      "output" : {
         "format" : {
            "rate" : 24000,
            "type" : "audio/pcm"
         },
         "speed" : 1,
         "voice" : "alloy"
      }
   },
   "expires_at" : 1742188264,
   "id" : "sess_001",
   "instructions" : "You are a friendly assistant.",
   "max_output_tokens" : "inf",
   "model" : "gpt-realtime",
   "object" : "realtime.session",
   "output_modalities" : [
      "audio"
   ],
   "prompt" : null,
   "tool_choice" : "none",
   "tools" : [],
   "tracing" : "auto",
   "truncation" : "auto"
}

SCHEMAS

AudioTranscriptionResponse

Properties:

  • language (string) - The language of the input audio.

  • model (anyOf) - The model used for transcription. Current options are whisper-1 , gpt-4o-mini-transcribe , gpt-4o-mini-transcribe-2025-12-15 , gpt-4o-transcribe , gpt-4o-transcribe-diarize , and gpt-realtime-whisper .

  • prompt (string) - The prompt configured for input audio transcription, when present.

NoiseReductionType

Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.

Prompt

Reference to a prompt template and its variables. Learn more .

RealtimeAudioFormats

The PCM audio format. Only a 24kHz sample rate is supported.

RealtimeFunctionTool

Properties:

  • description (string) - The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).

  • name (string) - The name of the function.

  • parameters (object) - Parameters of the function in JSON Schema.

  • type (string) - The type of the tool, i.e. function .

    Allowed values: function

RealtimeSessionCreateRequest

Properties:

  • client_secret (object, required) - Ephemeral key returned by the API.

  • input_audio_format (string) - The format of input audio. Options are pcm16 , g711_ulaw , or g711_alaw .

  • input_audio_transcription (object) - Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously and should be treated as rough guidance rather than the representation understood by the model.

  • instructions (string) - The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior. Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.

  • max_response_output_tokens (oneOf) - Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf .

  • modalities (unknown) - The set of modalities the model can respond with. To disable audio, set this to ["text"].

  • output_audio_format (string) - The format of output audio. Options are pcm16 , g711_ulaw , or g711_alaw .

  • prompt (Prompt)

    See "Prompt" below for shape.

  • speed (number) - The speed of the model's spoken response. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.

    Default: 1

  • temperature (number) - Sampling temperature for the model, limited to [0.6, 1.2]. Defaults to 0.8.

  • tool_choice (string) - How the model chooses tools. Options are auto , none , required , or specify a function.

  • tools (array of object) - Tools (functions) available to the model.

  • tracing (oneOf) - Configuration options for tracing. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.

    auto will create a trace for the session with default values for the workflow name, group id, and metadata.

  • truncation (RealtimeTruncation)

    See "RealtimeTruncation" below for shape.

  • turn_detection (object) - Configuration for turn detection. Can be set to null to turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.

  • voice (VoiceIdsOrCustomVoice) - The voice the model uses to respond. Supported built-in voices are alloy , ash , ballad , coral , echo , sage , shimmer , verse , marin , and cedar . You may also provide a custom voice object with an id , for example { "id": "voice_1234" } . Voice cannot be changed during the session once the model has responded with audio at least once.

    See "VoiceIdsOrCustomVoice" below for shape.

RealtimeSessionCreateResponse

Properties:

  • audio (object) - Configuration for input and output audio for the session.

  • expires_at (integer) - Expiration timestamp for the session, in seconds since epoch.

  • id (string) - Unique identifier for the session that looks like sess_1234567890abcdef .

  • include (array of string) - Additional fields to include in server outputs. - item.input_audio_transcription.logprobs : Include logprobs for input audio transcription.

  • instructions (string) - The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.

    Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.

  • max_output_tokens (oneOf) - Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf .

  • model (string) - The Realtime model used for this session.

  • object (string) - The object type. Always realtime.session .

  • output_modalities (unknown) - The set of modalities the model can respond with. To disable audio, set this to ["text"].

  • tool_choice (string) - How the model chooses tools. Options are auto , none , required , or specify a function.

  • tools (array of RealtimeFunctionTool) - Tools (functions) available to the model.

  • tracing (oneOf) - Configuration options for tracing. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.

    auto will create a trace for the session with default values for the workflow name, group id, and metadata.

  • turn_detection (object) - Configuration for turn detection. Can be set to null to turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.

RealtimeTruncation

When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.

Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.

Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.

Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.

VoiceIdsOrCustomVoice

A built-in voice name or a custom voice reference.

VoiceIdsShared

See https://platform.openai.com/docs/api-reference for details.

SEE ALSO

OpenAPI::Client::OpenAI::Path

COPYRIGHT AND LICENSE

Copyright (C) 2023-2026 by Nelson Ferraz

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.0 or, at your option, any later version of Perl 5 you may have available.