Diarization Mode
Diarization determines how the system identifies and handles multiple speakers.
Action Phrase offers three diarization modes:
- SpeakerManager
- SortFormer
- LS-EEND
They solve different problems — and understanding the tradeoffs is important.
SpeakerManager
SpeakerManager is designed for identifying a single known speaker.
Key Characteristics
- Supports one enrolled voice profile
- Recognizes that specific person’s voice
- Best when only one person is speaking at a time
Voice Enrollment
SpeakerManager allows you to:
- Record and store one voice profile.
- Improve recognition reliability for that person.
- Reduce accidental triggers from unrelated voices.
Important Limitation
SpeakerManager performs best when:
- Only one person is speaking.
It is not designed for overlapping speech. If multiple voices are speaking simultaneously, recognition accuracy for the enrolled voice may decrease.
SortFormer
SortFormer is a transformer-based diarization model designed for multi-speaker environments.
Key Characteristics
- Handles overlapping speech more effectively
- Better at isolating a calibrated voice when others are speaking
- Does not support stored voice profiles
- Includes optional Silence Detection
Calibrated Voice Recognition
SortFormer can still identify and prioritize the calibrated voice, even when other voices are present — often more reliably than SpeakerManager in overlapping scenarios.
However:
- You cannot store persistent voice profiles.
- Enrollment-style voice storage is not supported.
- It relies on real-time modeling rather than saved identity profiles.
Silence Detection
When enabled, Silence Detection:
- Identifies pauses between speech segments
- Helps finalize transcript segments more cleanly
- Improves responsiveness during natural breaks
LS-EEND
LS-EEND is an end-to-end diarization model that handles overlap by processing all speech activity together, but it has different strengths than SortFormer.
Key Characteristics
- Handles overlapping speech in a single pass
- Good for detecting speech activity across multiple participants
- Best used when you need conversational speaker segmentation, not stable speaker identity tracking
- Speaker labels are inferred per session and are not tied to a persistent identity
- Up to 10 speakers can be identified during a session
How LS-EEND Differs from SortFormer
- LS-EEND can collapse acoustically similar speakers into a single speaker in some conversations
- In practice, it may not always split people into distinct speaker channels consistently
- When separation is critical, SortFormer can be more reliable for room-style continuity
- SortFormer remains the recommended choice for “Rooms” because it tends to maintain better speaker separation in real usage
Important Limitation
- LS-EEND is not ideal when you need distinct, repeatable speaker identity across a room session
- In busy environments, you may get merged speaker tracks that remain merged for long stretches
Best Use Case
- Use LS-EEND for trigger scenarios involving multiple simultaneous speakers where coarse multi-speaker coverage matters
- Use SortFormer for room scenarios where distinguishing individuals consistently is more important
Silence Detection
When enabled with LS-EEND, Silence Detection helps by:
- Marking cleaner segment boundaries during pauses
- Reducing fragmented transcriptions
- Making multi-speaker sessions easier to parse post-session
\