Diarization Mode

Diarization determines how the system identifies and handles multiple speakers.

Action Phrase offers three diarization modes:

  • SpeakerManager
  • SortFormer
  • LS-EEND

They solve different problems — and understanding the tradeoffs is important.

SpeakerManager

SpeakerManager is designed for identifying a single known speaker.

Key Characteristics

  • Supports one enrolled voice profile
  • Recognizes that specific person’s voice
  • Best when only one person is speaking at a time

Voice Enrollment

SpeakerManager allows you to:

  • Record and store one voice profile.
  • Improve recognition reliability for that person.
  • Reduce accidental triggers from unrelated voices.

Important Limitation

SpeakerManager performs best when:

  • Only one person is speaking.

It is not designed for overlapping speech. If multiple voices are speaking simultaneously, recognition accuracy for the enrolled voice may decrease.


SortFormer

SortFormer is a transformer-based diarization model designed for multi-speaker environments.

Key Characteristics

  • Handles overlapping speech more effectively
  • Better at isolating a calibrated voice when others are speaking
  • Does not support stored voice profiles
  • Includes optional Silence Detection

Calibrated Voice Recognition

SortFormer can still identify and prioritize the calibrated voice, even when other voices are present — often more reliably than SpeakerManager in overlapping scenarios.

However:

  • You cannot store persistent voice profiles.
  • Enrollment-style voice storage is not supported.
  • It relies on real-time modeling rather than saved identity profiles.

Silence Detection

When enabled, Silence Detection:

  • Identifies pauses between speech segments
  • Helps finalize transcript segments more cleanly
  • Improves responsiveness during natural breaks

LS-EEND

LS-EEND is an end-to-end diarization model that handles overlap by processing all speech activity together, but it has different strengths than SortFormer.

Key Characteristics

  • Handles overlapping speech in a single pass
  • Good for detecting speech activity across multiple participants
  • Best used when you need conversational speaker segmentation, not stable speaker identity tracking
  • Speaker labels are inferred per session and are not tied to a persistent identity
  • Up to 10 speakers can be identified during a session

How LS-EEND Differs from SortFormer

  • LS-EEND can collapse acoustically similar speakers into a single speaker in some conversations
  • In practice, it may not always split people into distinct speaker channels consistently
  • When separation is critical, SortFormer can be more reliable for room-style continuity
  • SortFormer remains the recommended choice for “Rooms” because it tends to maintain better speaker separation in real usage

Important Limitation

  • LS-EEND is not ideal when you need distinct, repeatable speaker identity across a room session
  • In busy environments, you may get merged speaker tracks that remain merged for long stretches

Best Use Case

  • Use LS-EEND for trigger scenarios involving multiple simultaneous speakers where coarse multi-speaker coverage matters
  • Use SortFormer for room scenarios where distinguishing individuals consistently is more important

Silence Detection

When enabled with LS-EEND, Silence Detection helps by:

  • Marking cleaner segment boundaries during pauses
  • Reducing fragmented transcriptions
  • Making multi-speaker sessions easier to parse post-session

\

results matching ""

    No results matching ""