NAME
OpenAPI::Client::OpenAI::Path::realtime-transcription_sessions - Documentation for the /realtime/transcription_sessions path.
OPERATIONS
POST /realtime/transcription_sessions
create-realtime-transcription-session
$client->create_realtime_transcription_session({
body => { ... },
});
Create an ephemeral API token for use in client-side applications with the Realtime API specifically for realtime transcriptions. Can be configured with the same session parameters as the transcription_session.update client event.
It responds with a session object, plus a client_secret key which contains a usable ephemeral API token that can be used to authenticate browser clients for the Realtime API.
Returns the created Realtime transcription session object, plus an ephemeral key.
Responses
200 - Session created successfully.
Content-Type: application/json
Example:
{
"client_secret" : null,
"expires_at" : 1742188264,
"id" : "sess_BBwZc7cFV3XizEyKGDCGL",
"input_audio_format" : "pcm16",
"input_audio_transcription" : {
"language" : null,
"model" : "gpt-4o-transcribe",
"prompt" : ""
},
"modalities" : [
"audio",
"text"
],
"object" : "realtime.transcription_session",
"turn_detection" : {
"prefix_padding_ms" : 300,
"silence_duration_ms" : 200,
"threshold" : 0.5,
"type" : "server_vad"
}
}
SCHEMAS
AudioTranscription
Properties:
delay(string) - Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported withgpt-realtime-whisperin GA Realtime sessions.Allowed values: minimal, low, medium, high, xhigh
language(string) - The language of the input audio. Supplying the input language in ISO-639-1 (e.g.en) format will improve accuracy and latency.model(anyOf) - The model to use for transcription. Current options arewhisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.prompt(string) - An optional text to guide the model's style or continue a previous audio segment. Forwhisper-1, the prompt is a list of keywords . Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
AudioTranscriptionResponse
Properties:
language(string) - The language of the input audio.model(anyOf) - The model used for transcription. Current options arewhisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper.prompt(string) - The prompt configured for input audio transcription, when present.
NoiseReductionType
Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
RealtimeTranscriptionSessionCreateRequest
Properties:
include(array of string) - The set of items to include in the transcription. Current available items are:item.input_audio_transcription.logprobsinput_audio_format(string) - The format of input audio. Options arepcm16,g711_ulaw, org711_alaw. Forpcm16, input audio must be 16-bit PCM at a 24kHz sample rate, single channel (mono), and little-endian byte order.Allowed values: pcm16, g711_ulaw, g711_alaw
Default: pcm16
input_audio_noise_reduction(object) - Configuration for input audio noise reduction. This can be set tonullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.Default: null
input_audio_transcription(AudioTranscription) - Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.See "AudioTranscription" below for shape.
turn_detection(object) - Configuration for turn detection. Can be set tonullto turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
RealtimeTranscriptionSessionCreateResponse
Properties:
client_secret(object, required) - Ephemeral key returned by the API. Only present when the session is created on the server via REST API.input_audio_format(string) - The format of input audio. Options arepcm16,g711_ulaw, org711_alaw.input_audio_transcription(AudioTranscriptionResponse) - Configuration of the transcription model.See "AudioTranscriptionResponse" below for shape.
modalities(unknown) - The set of modalities the model can respond with. To disable audio, set this to ["text"].turn_detection(object) - Configuration for turn detection. Can be set tonullto turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
SEE ALSO
COPYRIGHT AND LICENSE
Copyright (C) 2023-2026 by Nelson Ferraz
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.0 or, at your option, any later version of Perl 5 you may have available.