# Speech AI - Pronunciation, STT & TTS MCP server

Pronunciation scoring, speech-to-text, and text-to-speech for language learning

## Links
- Registry page: https://www.getdrio.com/mcp/io-github-fasuizu-br-speech-ai
- Repository: https://github.com/fasuizu-br/speech-ai-examples
- Website: https://brainiall.com

## Install
- Endpoint: https://apim-ai-apis.azure-api.net/mcp/pronunciation/mcp
- Auth: Not captured

## Setup notes
- Remote endpoint: https://apim-ai-apis.azure-api.net/mcp/pronunciation/mcp

## Tools
- assess_pronunciation (Assess Pronunciation) - Assess English pronunciation quality from audio.

Scores pronunciation at four levels: overall, sentence, word, and phoneme.
Each score is 0-100. Phonemes are returned in both IPA and ARPAbet notation.
Sub-300ms inference latency.

Args:
    audio_base64: Base64-encoded audio data. Supports WAV, MP3, OGG, and WebM formats.
    text: The reference English text that the speaker was expected to read aloud.
    audio_format: Audio format hint — one of 'wav', 'mp3', 'ogg', 'webm'. Defaults to 'wav'.

Returns:
    dict with keys:
        - overallScore (int 0-100): Overall pronunciation quality
        - sentenceScore (int 0-100): Sentence-level fluency and accuracy
        - words (list): Per-word scores, each containing:
            - word (str): The word
            - score (int 0-100): Word pronunciation score
            - phonemes (list): Per-phoneme scores with IPA/ARPAbet notation
        - decodedTranscript (str): What the model heard (ASR transcript)
        - transcript (str): Reference text
        - confidence (float 0-1): Scoring confidence
        - warnings (list[str]): Quality warnings if any
        - audioQuality (dict): Audio metrics (SNR, peak/RMS dB, etc.) Endpoint: https://apim-ai-apis.azure-api.net/mcp/pronunciation/mcp
- check_pronunciation_service (Check Pronunciation Service) - Check if the pronunciation assessment service is healthy and ready.

Returns:
    dict with keys:
        - status (str): 'healthy' or error state
        - modelLoaded (bool): Whether the scoring model is loaded
        - version (str): API version Endpoint: https://apim-ai-apis.azure-api.net/mcp/pronunciation/mcp
- get_phoneme_inventory (Get Phoneme Inventory) - Get the full phoneme inventory supported by the pronunciation scorer.

Returns a list of all English phonemes the engine can assess, including
ARPAbet symbol, IPA equivalent, example word, and phoneme category
(vowel, consonant, diphthong).

Returns:
    list of dicts, each with keys:
        - arpabet (str): ARPAbet symbol (e.g. 'AA', 'TH')
        - ipa (str): IPA notation
        - example (str): Example word containing the phoneme
        - category (str): vowel, consonant, or diphthong Endpoint: https://apim-ai-apis.azure-api.net/mcp/pronunciation/mcp
- transcribe_audio (Transcribe Audio) - Transcribe audio to text with word-level timestamps.

Converts spoken English audio into text with optional word-level timestamps
and per-word confidence scores.

Args:
    audio_base64: Base64-encoded audio data (WAV, MP3, OGG, FLAC, WebM).
    audio_format: Audio format hint. Auto-detected from magic bytes if omitted.
    include_timestamps: Whether to include word-level timing (default: true).

Returns:
    dict with keys:
        - text (str): Full decoded transcript
        - words (list): Per-word results with timestamps, each containing:
            - word (str): The transcribed word
            - start (float): Start time in seconds
            - end (float): End time in seconds
            - confidence (float 0-1): Word-level confidence
        - audioDurationMs (int): Audio duration in milliseconds
        - metadata (dict): Processing time, audio length, model version
        - audioQuality (dict): Audio metrics (SNR, peak/RMS dB, etc.) Endpoint: https://apim-ai-apis.azure-api.net/mcp/pronunciation/mcp
- check_stt_service (Check STT Service) - Check if the speech-to-text service is healthy and ready.

Returns:
    dict with keys:
        - status (str): 'healthy' or error state
        - modelLoaded (bool): Whether the STT model is loaded
        - version (str): API version Endpoint: https://apim-ai-apis.azure-api.net/mcp/pronunciation/mcp
- synthesize_speech (Synthesize Speech) - Generate natural speech audio from English text.

Produces high-quality speech with 12 English voices.
Returns base64-encoded WAV audio (16-bit PCM, 24kHz mono) along with metadata.

Available voices:
- af_heart (default), af_bella, af_nicole, af_sarah, af_sky (American female)
- am_adam, am_michael (American male)
- bf_emma, bf_isabella (British female)
- bm_george, bm_lewis, bm_daniel (British male)

Args:
    text: English text to synthesize (1-5000 characters).
    voice: Voice ID. See list above. Defaults to 'af_heart'.
    speed: Speed multiplier from 0.5 to 2.0 (default: 1.0).

Returns:
    dict with keys:
        - audio_base64 (str): Base64-encoded WAV audio (16-bit PCM, 24kHz)
        - duration_ms (str): Audio duration in milliseconds
        - voice (str): Voice ID used
        - text_length (str): Input text character count
        - processing_ms (str): Synthesis time in milliseconds Endpoint: https://apim-ai-apis.azure-api.net/mcp/pronunciation/mcp
- list_tts_voices (List TTS Voices) - List all available text-to-speech voices with metadata.

Returns:
    dict with keys:
        - voices (list): Available voices, each with id, name, gender, accent, grade
        - defaultVoice (str): Default voice ID Endpoint: https://apim-ai-apis.azure-api.net/mcp/pronunciation/mcp
- check_tts_service (Check TTS Service) - Check if the text-to-speech service is healthy and ready.

Returns:
    dict with keys:
        - status (str): 'healthy' or error state
        - modelLoaded (bool): Whether the TTS model is loaded
        - version (str): API version Endpoint: https://apim-ai-apis.azure-api.net/mcp/pronunciation/mcp
- transcribe_audio_pro (Transcribe Audio Pro (Whisper)) - Transcribe audio with Whisper Large V3 Turbo — multilingual STT.

Supports 99 languages with automatic language detection, word-level
timestamps, per-word confidence scores, and optional speaker diarization
(identifies who spoke each word). Best-in-class WER (~2%).

Args:
    audio_base64: Base64-encoded audio (WAV, MP3, OGG, FLAC, WebM).
    language: Language code. Auto-detected if omitted. Supports 99 languages.
    diarize: Enable speaker diarization (default: false). When true, each word
        includes a speaker label (e.g. SPEAKER_00, SPEAKER_01).

Returns:
    dict with keys:
        - text (str): Full decoded transcript
        - words (list): Per-word results with timestamps, each containing:
            - word (str), start (float), end (float), confidence (float 0-1)
            - speaker (str|null): Speaker label when diarize=true
        - speakers (dict|null): Speaker info with count and labels
        - audioDurationMs (int): Audio duration in milliseconds
        - metadata (dict): Processing time, language, languageProbability
        - audioQuality (dict): Audio metrics (SNR, peak/RMS dB, etc.) Endpoint: https://apim-ai-apis.azure-api.net/mcp/pronunciation/mcp
- check_whisper_service (Check Whisper Service) - Check if the Whisper STT Pro service is healthy and ready.

Returns:
    dict with keys:
        - status (str): 'healthy' or error state
        - modelLoaded (bool): Whether the Whisper model is loaded
        - diarizeLoaded (bool): Whether the diarization pipeline is loaded
        - version (str): API version
        - modelName (str): Whisper model name (e.g. 'large-v3-turbo') Endpoint: https://apim-ai-apis.azure-api.net/mcp/pronunciation/mcp

## Resources
Not captured

## Prompts
Not captured

## Metadata
- Owner: io.github.fasuizu-br
- Version: 2.3.0
- Runtime: Streamable Http
- Transports: HTTP
- License: Not captured
- Language: Not captured
- Stars: Not captured
- Updated: Mar 5, 2026
- Source: https://registry.modelcontextprotocol.io