Audio Processing Models

We provide several most popular Audio processing models for scenarios like TTS(Text To Speech) and ASR(Automatic Speech Recognition)/STT(Speech To Text).

Before started

You should get some parameters before get started : SERVICE_ID, API_KEY and MODEL, you can find them on our dashbord (opens in a new tab).

API reference

TTS

support model

fish-speech-1.4

/v1/audio/speech

Parameter name	Type	Description	Required
model	String	model type: fish-speech-1.4, Whisper-large-v3, Whisper-large-v3-turbo	Yes
input	String	The text to generate audio for. Maximum length is 4096.	Yes
response_format	String	Audio format: mp3, wav, pcm. default: wav	Yes

Response Elements

Parameter name	Type	Description
		Audio file content

ASR

support model

fish-speech-1.4
Whisper-large-v3
Whisper-large-v3-turbo

/v1/audio/transcriptions

Request Parameters

Parameter name	Type	Description	Required
model	String	model type: fish-speech-1.4, Whisper-large-v3, Whisper-large-v3-turbo	Yes
file	String	The audio file object to transcribe, must be one of these formats: flac, mp3, mp4, mpeg, mgpa, m4a, ogg, wav, webm	Yes
language	String	The language of audio file, format must in ISO-639-1.	Yes

Response Elements

Parameter name	Type	Description
text	String

Usage

python

/v1/audio/speech

from pathlib import Path
import openai
 
client = openai.OpenAI(
    base_url="https://modelapi.holmesai.xyz/$SERVICE_ID/v1",
    api_key="$API_KEY",
)
 
output_file_path = Path(__file__).parent / "output.wav"
response = client.audio.speech.create(
    model="$model",
    input="The quick brown fox jumped over the lazy dog.",
)
response.stream_to_file(output_file_path)

/v1/audio/transcriptions

import openai
 
client = openai.OpenAI(
    base_url="https://modelapi.holmesai.xyz/$SERVICE_ID/v1",
    api_key="$API_KEY",
)
 
audio_file = open("input.wav", "rb")
transcript = client.audio.transcriptions.create(
    model="$MODEL", file=audio_file
)
print(transcript)

curl

/v1/audio/speech

curl -v --output output.wav -d '{
    "model": "$MODEL",
    "input": "The quick brown fox jumped over the lazy dog."
  }' -H "Authorization: Bearer $API_KEY" -H 'Content-Type: application/json'  https://modelapi.holmesai.xyz/$SERVICE_ID/v1/audio/speech

/v1/audio/transcriptions

curl -v https://llmapi.holmesai.xyz/$SERVICE_ID/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@input.wav" \
  -F "metadata={\"model\":\"$MODEL\"}"

Large Language Models Image Processing Models