Microsoft Speech to Text¶
Overview¶
The Microsoft Speech-to-Text Service in AIForged uses Microsoftβs Azure AI Speech to accurately transcribe audio into text across many languages. Transcripts are stored in the documentβs Result property for downstream search, analytics, or workflow automation. Models can be tailored to improve accuracy for domain-specific vocabulary.
Info
Tip: Use this service to quickly turn recorded meetings, calls, or podcasts into searchable, actionable text. For structured data extraction from documents, use Document Intelligence.
Permissions Required¶
Members must belong to one of the following AIForged user group roles to add and configure this service:
- Owner
- Administrator
- Developer
Info
Tip: Role membership is managed in Organisations > Roles. Assign members to roles to grant agent and service administration access.
Supported Content Types¶
- MP3
- WAV (PCM)
Info
Tip: If your audio is in another format (e.g., M4A, AAC, OGG), transcode it to MP3 or WAV using your preferred media converter before uploading.
Possible Use Cases¶
- Generate meeting minutes or summaries from recorded sessions.
- Transcribe customer calls for QA, analytics, or compliance.
- Produce captions/subtitles for training videos and webinars.
- Extract music lyrics or spoken content from audio tracks (subject to licensing).
Service Setup¶
Follow these steps to add and configure the Microsoft Speech-to-Text Service to your agent:
- Open the Agent View Navigate to the agent where you want to add the service.
- Add the Microsoft Speech-to-Text Service
Click the Add Service
button.
- Select Service Type
Choose Microsoft Speech-to-Text Service from the available service types.
- Configure the Service Wizard
-
Open the Service Configuration Wizard. 1.

2.

- Step 1: General Settings Configure the service name, description, and core settings. _Default settings are sufficient for most use cases.
Service Configuration Settings¶
Most users can proceed with the default settings. Advanced configuration is available for custom workflows.
Setting | Type | Required? | Description |
---|---|---|---|
ArchivingStrategy | Optional | No | Number of days before documents are deleted. |
AccessKey | Optional | No | Override the Microsoft cloud access key (typically not required in AIForged). |
BaseURL | Optional | No | Override the Speech-to-Text endpoint (advanced; usually not required). |
BatchSize | Hidden | - | Processing batch size. |
DocumentProcessedStatus | Optional | No | Status applied after successful transcription. |
Enabled | Hidden | - | Enable or disable the service. |
ExecuteBeforeProcess | Optional | No | If configured as a child service, execute before the parent service. |
ExecuteAfterProcess | Optional | No | If configured as a child service, execute after the parent service. |
Language | Optional | No | Specify the primary spoken language of the audio (e.g., en-US). |
Password | Optional | No | Authentication/password handling; can be set per document via Custom Code. |
RemoveComments | Optional | No | Remove human comments/annotations in document metadata (not typical for audio). |
Info
Tip: If unsure, keep defaults unless you have a specific processing or integration requirement. Setting the correct Language improves transcription accuracy.
Add and Process Documents¶
To upload and process audio using the Microsoft Speech-to-Text Service:
- Open Service When you open the Microsoft Speech-to-Text Service, you will be presented with the documents currently queued or processed in the Inbox.
- Upload Audio
Click the Upload
button or drag and drop files over the document grid (MP3 or WAV).
- Select Category (Optional) If you know the category for the audio, select it. Otherwise, select No category.
- Process Documents After uploading, select the audio files to process and click Process Checked.
Info
Tip: For new services, process a small batch first to verify transcription quality before scaling up.
View Processed Documents¶
- Select Outbox in the usage filter in the Microsoft Speech-to-Text Service.
- Open any processed document to view the transcript in the Result property.
Troubleshooting Tips¶
- Transcript missing words or inaccurate?
- Ensure clear audio: minimize background noise, echo, or music.
- Set the correct Language (e.g., en-GB vs en-US).
- Prefer mono recordings with consistent levels; avoid clipping.
- Long files take a while to complete?
- Longer recordings may be processed asynchronously by the provider and take more time.
- Split very long audio into smaller segments to keep processing responsive.
- Audio wonβt process?
- Confirm the file format is MP3 or WAV and not DRMβprotected/encrypted.
- Reβexport the audio with a standard codec and a constant sample rate (e.g., 16 kHz mono WAV or 128 kbps mono MP3).
- Multiple speakers in a single recording?
- Overlapping speakers and crosstalk reduce accuracy; use separate microphones when possible or pre-segment the audio.
Best Practices¶
- Record at a consistent level in a quiet environment; reduce background noise and reverberation.
- Use mono channels for speech; 16 kHz or higher sample rate is recommended for better accuracy.
- Set the correct Language to match the audio content.
- Trim long silences and split long recordings into smaller parts for faster, more reliable processing.
- Validate a representative sample before large-scale processing, and standardize your capture/export settings across sources.