Diarization Mode

Diarization determines how the system identifies and handles multiple speakers.

Action Phrase offers three diarization modes:

SpeakerManager
SortFormer
LS-EEND

They solve different problems — and understanding the tradeoffs is important.

SpeakerManager

SpeakerManager is designed for identifying a single known speaker.

Key Characteristics

Supports one enrolled voice profile
Recognizes that specific person’s voice
Best when only one person is speaking at a time

Voice Enrollment

SpeakerManager allows you to:

Record and store one voice profile.
Improve recognition reliability for that person.
Reduce accidental triggers from unrelated voices.

Important Limitation

SpeakerManager performs best when:

Only one person is speaking.

It is not designed for overlapping speech. If multiple voices are speaking simultaneously, recognition accuracy for the enrolled voice may decrease.

SortFormer

SortFormer is a transformer-based diarization model designed for multi-speaker environments.

Key Characteristics

Handles overlapping speech more effectively
Better at isolating a calibrated voice when others are speaking
Does not support stored voice profiles
Includes optional Silence Detection

Calibrated Voice Recognition

SortFormer can still identify and prioritize the calibrated voice, even when other voices are present — often more reliably than SpeakerManager in overlapping scenarios.

However:

You cannot store persistent voice profiles.
Enrollment-style voice storage is not supported.
It relies on real-time modeling rather than saved identity profiles.

Silence Detection

When enabled, Silence Detection:

Identifies pauses between speech segments
Helps finalize transcript segments more cleanly
Improves responsiveness during natural breaks

LS-EEND

LS-EEND is an end-to-end diarization model that handles overlap by processing all speech activity together, but it has different strengths than SortFormer.

Key Characteristics

Handles overlapping speech in a single pass
Good for detecting speech activity across multiple participants
Best used when you need conversational speaker segmentation, not stable speaker identity tracking
Speaker labels are inferred per session and are not tied to a persistent identity
Up to 10 speakers can be identified during a session

How LS-EEND Differs from SortFormer

LS-EEND can collapse acoustically similar speakers into a single speaker in some conversations
In practice, it may not always split people into distinct speaker channels consistently
When separation is critical, SortFormer can be more reliable for room-style continuity
SortFormer remains the recommended choice for “Rooms” because it tends to maintain better speaker separation in real usage

Important Limitation

LS-EEND is not ideal when you need distinct, repeatable speaker identity across a room session
In busy environments, you may get merged speaker tracks that remain merged for long stretches

Best Use Case

Use LS-EEND for trigger scenarios involving multiple simultaneous speakers where coarse multi-speaker coverage matters
Use SortFormer for room scenarios where distinguishing individuals consistently is more important

Diarization Mode

Diarization Mode

SpeakerManager

Key Characteristics

Voice Enrollment

Important Limitation

SortFormer

Key Characteristics

Calibrated Voice Recognition

Silence Detection

LS-EEND

Key Characteristics

How LS-EEND Differs from SortFormer

Important Limitation

Best Use Case

Silence Detection

results matching ""

No results matching ""