NAME

OpenAPI::Client::OpenAI::Path::realtime-calls-call_id-accept - Documentation for the /realtime/calls/{call_id}/accept path.

OPERATIONS

POST /realtime/calls/{call_id}/accept

accept-realtime-call

$client->accept_realtime_call({
    body => { ... },
});

Accept an incoming SIP call and configure the realtime session that will handle it.

Path/query parameters

  • call_id (in path, required, string) - The identifier for the call provided in the realtime.call.incoming webhook.

Responses

200 - Call accepted successfully.

SCHEMAS

AudioTranscription

Properties:

  • delay (string) - Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.

    Allowed values: minimal, low, medium, high, xhigh

  • language (string) - The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en ) format will improve accuracy and latency.

  • model (anyOf) - The model to use for transcription. Current options are whisper-1 , gpt-4o-mini-transcribe , gpt-4o-mini-transcribe-2025-12-15 , gpt-4o-transcribe , gpt-4o-transcribe-diarize , and gpt-realtime-whisper . Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.

  • prompt (string) - An optional text to guide the model's style or continue a previous audio segment. For whisper-1 , the prompt is a list of keywords . For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize ), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.

NoiseReductionType

Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.

Prompt

Reference to a prompt template and its variables. Learn more .

RealtimeAudioFormats

The PCM audio format. Only a 24kHz sample rate is supported.

RealtimeReasoning

Properties:

RealtimeReasoningEffort

Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2 .

RealtimeSessionCreateRequestGA

Properties:

  • audio (object) - Configuration for input and output audio.

  • include (array of string) - Additional fields to include in server outputs.

    item.input_audio_transcription.logprobs : Include logprobs for input audio transcription.

  • instructions (string) - The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.

    Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.

  • max_output_tokens (oneOf) - Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf .

  • model (anyOf) - The Realtime model used for this session.

  • output_modalities (array of string) - The set of modalities the model can respond with. It defaults to ["audio"] , indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.

    Default: ["audio"]

  • parallel_tool_calls (boolean) - Whether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as gpt-realtime-2 .

  • prompt (Prompt)

    See "Prompt" below for shape.

  • reasoning (RealtimeReasoning)

    See "RealtimeReasoning" below for shape.

  • tool_choice (oneOf) - How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.

    Default: auto

  • tools (array of object) - Tools available to the model.

  • tracing (oneOf) - Realtime API can write session traces to the Traces Dashboard . Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.

    auto will create a trace for the session with default values for the workflow name, group id, and metadata.

    Default: null

  • truncation (RealtimeTruncation)

    See "RealtimeTruncation" below for shape.

  • type (string, required) - The type of session to create. Always realtime for the Realtime API.

    Allowed values: realtime

RealtimeTruncation

When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.

Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.

Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.

Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.

RealtimeTurnDetection

Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.

Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.

Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

For gpt-realtime-whisper transcription sessions, turn detection must be set to null ; VAD is not supported.

VoiceIdsOrCustomVoice

A built-in voice name or a custom voice reference.

SEE ALSO

OpenAPI::Client::OpenAI::Path

COPYRIGHT AND LICENSE

Copyright (C) 2023-2026 by Nelson Ferraz

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.0 or, at your option, any later version of Perl 5 you may have available.