Diarization Mode
Diarization determines how the system identifies and handles multiple speakers.
Action Phrase offers two diarization modes:
- SpeakerManager
- SortFormer
They solve different problems — and understanding the tradeoffs is important.
SpeakerManager
SpeakerManager is designed for identifying a single known speaker.
Key Characteristics
- Supports one enrolled voice profile
- Recognizes that specific person’s voice
- Best when only one person is speaking at a time
Voice Enrollment
SpeakerManager allows you to:
- Record and store one voice profile.
- Improve recognition reliability for that person.
- Reduce accidental triggers from unrelated voices.
Important Limitation
SpeakerManager performs best when:
- Only one person is speaking.
It is not designed for overlapping speech. If multiple voices are speaking simultaneously, recognition accuracy for the enrolled voice may decrease.
SortFormer
SortFormer is a transformer-based diarization model designed for multi-speaker environments.
Key Characteristics
- Handles overlapping speech more effectively
- Better at isolating a calibrated voice when others are speaking
- Does not support stored voice profiles
- Includes optional Silence Detection
Calibrated Voice Recognition
SortFormer can still identify and prioritize the calibrated voice, even when other voices are present — often more reliably than SpeakerManager in overlapping scenarios.
However:
- You cannot store persistent voice profiles.
- Enrollment-style voice storage is not supported.
- It relies on real-time modeling rather than saved identity profiles.
Silence Detection
When enabled, Silence Detection:
- Identifies pauses between speech segments
- Helps finalize transcript segments more cleanly
- Improves responsiveness during natural breaks