跳转到主要内容

将OpenAI的Whisper与diart 相结合,即可获得扬声器识别字幕!

我们将在这篇文章中构建的扬声器感知转录系统的演示

在这篇文章中,我将向您展示如何将OpenAI的语音识别Whisper与流式扬声器日记结合起来,以获得实时的扬声器彩色转录,如上所示。

它是如何工作的?

Diart是一个基于人工智能的Python库,用于流式扬声器日记(即说“谁在什么时候说话”),建立在pyannote.audio模型之上,专门为现场音频流(例如麦克风)设计。

通过几行代码,diart可以让您获得实时扬声器标签,如下所示:

Streaming speaker diarization with diart

同时,Whisper是OpenAI最近推出的一款针对自动语音识别(ASR)进行训练的模型,它对噪声条件特别鲁棒,非常适合现实生活中的用例。

设置所有内容

  • 按照此处的说明安装diart
  • 安装带whisper-timestamped
    • pip install git+https://github.com/linto-ai/whisper-timestamped

在本文的其余部分中,我将依靠RxPY(Python的反应式编程扩展)来完成流媒体部分。如果您不熟悉它,我建议您查看此文档页面,以掌握基本知识。

简而言之,反应式编程就是对来自给定源(在我们的例子中是麦克风)的发射项目(在我们这里是音频块)进行操作。

结合日记(diarization )和转录

让我们从源代码的概述开始,然后将其分解为块以更好地理解它。

import logging
import traceback
import diart.operators as dops
import rich
import rx.operators as ops
from diart import OnlineSpeakerDiarization, PipelineConfig
from diart.sources import MicrophoneAudioSource

# Suppress whisper-timestamped warnings for a clean output
logging.getLogger("whisper_timestamped").setLevel(logging.ERROR)

config = PipelineConfig(
    duration=5,
    step=0.5,
    latency="min",
    tau_active=0.5,
    rho_update=0.1,
    delta_new=0.57
)
dia = OnlineSpeakerDiarization(config)
source = MicrophoneAudioSource(config.sample_rate)

asr = WhisperTranscriber(model="small")

transcription_duration = 2
batch_size = int(transcription_duration // config.step)
source.stream.pipe(
    dops.rearrange_audio_stream(
        config.duration, config.step, config.sample_rate
    ),
    ops.buffer_with_count(count=batch_size),
    ops.map(dia),
    ops.map(concat),
    ops.filter(lambda ann_wav: ann_wav[0].get_timeline().duration() > 0),
    ops.starmap(asr),
    ops.map(colorize_transcription),
).subscribe(on_next=rich.print, on_error=lambda _: traceback.print_exc())

print("Listening...")
source.read()

创建扬声器日记模块

首先,我们创建流媒体(也称为“在线”)扬声器日记系统,以及连接到本地麦克风的音频源。

我们将系统配置为使用5秒的滑动窗口,步长为500ms(默认值),并将延迟设置为最小值(500ms)以提高响应能力。

# If you have a GPU, you can also set device=torch.device("cuda")
config = PipelineConfig(
    duration=5,
    step=0.5,
    latency="min",
    tau_active=0.5,
    rho_update=0.1,
    delta_new=0.57
)
dia = OnlineSpeakerDiarization(config)
source = MicrophoneAudioSource(config.sample_rate)

该配置的三个附加参数调节说话人识别的灵敏度:

  • tau_active=0.5:仅识别讲话概率高于50%的说话者。
  • rho_update=0.1:Diart自动收集演讲者的信息以改进自己(别担心,这是在本地完成的,不会与任何人共享)。在这里,我们只使用每个扬声器超过100毫秒的语音进行自我完善。
  • delta_new=0.57:这是一个介于0和2之间的内部阈值,用于调节新的扬声器检测。该值越低,系统对声音差异就越敏感。

这些价值观对我来说很有效,但我可以自由发挥,以获得最佳表现!如果您有自己的数据集,也可以使用此处显示的diart.tune自动调整这些参数。

创建ASR模块

接下来,我们使用我为本文创建的类WhisperTranscripter加载语音识别模型。

# If you have a GPU, you can also set device="cuda"
asr = WhisperTranscriber(model="small")

此类定义如下:

import os
import sys
import numpy as np
import whisper_timestamped as whisper
from pyannote.core import Segment
from contextlib import contextmanager

@contextmanager
def suppress_stdout():
    # Auxiliary function to suppress Whisper logs (it is quite verbose)
    # All credit goes to: https://thesmithfam.org/blog/2012/10/25/temporarily-suppress-console-output-in-python/
    with open(os.devnull, "w") as devnull:
        old_stdout = sys.stdout
        sys.stdout = devnull
        try:
            yield
        finally:
            sys.stdout = old_stdout

class WhisperTranscriber:
    def __init__(self, model="small", device=None):
        self.model = whisper.load_model(model, device=device)
        self._buffer = ""

    def transcribe(self, waveform):
        """Transcribe audio using Whisper"""
        # Pad/trim audio to fit 30 seconds as required by Whisper
        audio = waveform.data.astype("float32").reshape(-1)
        audio = whisper.pad_or_trim(audio)

        # Transcribe the given audio while suppressing logs
        with suppress_stdout():
            transcription = whisper.transcribe(
                self.model,
                audio,
                # We use past transcriptions to condition the model
                initial_prompt=self._buffer,
                verbose=True  # to avoid progress bar
            )

        return transcription

    def identify_speakers(self, transcription, diarization, time_shift):
        """Iterate over transcription segments to assign speakers"""
        speaker_captions = []
        for segment in transcription["segments"]:

            # Crop diarization to the segment timestamps
            start = time_shift + segment["words"][0]["start"]
            end = time_shift + segment["words"][-1]["end"]
            dia = diarization.crop(Segment(start, end))

            # Assign a speaker to the segment based on diarization
            speakers = dia.labels()
            num_speakers = len(speakers)
            if num_speakers == 0:
                # No speakers were detected
                caption = (-1, segment["text"])
            elif num_speakers == 1:
                # Only one speaker is active in this segment
                spk_id = int(speakers[0].split("speaker")[1])
                caption = (spk_id, segment["text"])
            else:
                # Multiple speakers, select the one that speaks the most
                max_speaker = int(np.argmax([
                    dia.label_duration(spk) for spk in speakers
                ]))
                caption = (max_speaker, segment["text"])
            speaker_captions.append(caption)

        return speaker_captions

    def __call__(self, diarization, waveform):
        # Step 1: Transcribe
        transcription = self.transcribe(waveform)
        # Update transcription buffer
        self._buffer += transcription["text"]
        # The audio may not be the beginning of the conversation
        time_shift = waveform.sliding_window.start
        # Step 2: Assign speakers
        speaker_transcriptions = self.identify_speakers(transcription, diarization, time_shift)
        return speaker_transcriptions

转录器实现接收音频块及其日记化的简单操作,并遵循以下步骤:

它用Whisper(带有单词时间戳)转录音频块

它通过对齐单词和说话者之间的时间戳,为转录的每个片段分配一个说话者

将两个模块放在一起

现在我们已经创建了日记化和转录模块,我们定义了将应用于每个音频块的操作链:

import traceback
import rich
import rx.operators as ops
import diart.operators as dops

# Split the stream into 2s chunks for transcription
transcription_duration = 2
# Apply models in batches for better efficiency
batch_size = int(transcription_duration // config.step)

# Chain of operations to apply on the stream of microphone audio
source.stream.pipe(
    # Format audio stream to sliding windows of 5s with a step of 500ms
    dops.rearrange_audio_stream(
        config.duration, config.step, config.sample_rate
    ),
    # Wait until a batch is full
    # The output is a list of audio chunks
    ops.buffer_with_count(count=batch_size),
    # Obtain diarization prediction
    # The output is a list of pairs `(diarization, audio chunk)`
    ops.map(dia),
    # Concatenate 500ms predictions/chunks to form a single 2s chunk
    ops.map(concat),
    # Ignore this chunk if it does not contain speech
    ops.filter(lambda ann_wav: ann_wav[0].get_timeline().duration() > 0),
    # Obtain speaker-aware transcriptions
    # The output is a list of pairs `(speaker: int, caption: str)`
    ops.starmap(asr),
    # Color transcriptions according to the speaker
    # The output is plain text with color references for rich
    ops.map(colorize_transcription),
).subscribe(
    on_next=rich.print,  # print colored text
    on_error=lambda _: traceback.print_exc()  # print stacktrace if error
)

在上面的代码中,来自麦克风的所有音频块都将通过我们定义的操作链进行推送。

在这个操作链中,我们首先使用重排音频流将音频格式化为5秒的块,它们之间的步长为500毫秒。然后,我们使用buffer_with_count填充下一批,并应用日记。请注意,批量大小已定义为与转录窗口的大小相匹配。

接下来,我们将批次中不重叠的500ms日记预测连接起来,并仅在音频包含语音的情况下应用WhisperTranscripter来获得说话者感知的转录。如果没有检测到语音,我们跳过这个区块,等待下一个。

最后,我们将文本着色,并使用丰富的库将其打印为标准输出。

由于这整个操作链可能有点模糊,我还准备了一张关于它如何运行的图表,我希望它能澄清算法:

Diagram of a single transcription step with diart and Whisper

您可能已经注意到,我还没有定义concat和colorize_transcriptions,但它们是非常简单的实用程序函数:

import numpy as np
from pyannote.core import Annotation, SlidingWindowFeature, SlidingWindow

def concat(chunks, collar=0.05):
    """
    Concatenate predictions and audio
    given a list of `(diarization, waveform)` pairs
    and merge contiguous single-speaker regions
    with pauses shorter than `collar` seconds.
    """
    first_annotation = chunks[0][0]
    first_waveform = chunks[0][1]
    annotation = Annotation(uri=first_annotation.uri)
    data = []
    for ann, wav in chunks:
        annotation.update(ann)
        data.append(wav.data)
    annotation = annotation.support(collar)
    window = SlidingWindow(
        first_waveform.sliding_window.duration,
        first_waveform.sliding_window.step,
        first_waveform.sliding_window.start,
    )
    data = np.concatenate(data, axis=0)
    return annotation, SlidingWindowFeature(data, window)

def colorize_transcription(transcription):
    """
    Unify a speaker-aware transcription represented as
    a list of `(speaker: int, text: str)` pairs
    into a single text colored by speakers.
    """
    colors = 2 * [
        "bright_red", "bright_blue", "bright_green", "orange3", "deep_pink1",
        "yellow2", "magenta", "cyan", "bright_magenta", "dodger_blue2"
    ]
    result = []
    for speaker, text in transcription:
        if speaker == -1:
            # No speakerfound for this text, use default terminal color
            result.append(text)
        else:
            result.append(f"[{colors[speaker]}]{text}")
    return "\n".join(result)

如果您不熟悉pyannote.audio中使用的Annotation和SlidingWindowFeature类,我建议您查看它们的官方文档页面。

在这里,我们使用SlidingWindowFeature作为音频块的numpy数组包装器,这些音频块也带有SlidingWindow实例提供的时间戳。

我们还使用Annotation作为首选数据结构来表示二元化预测。它们可以被视为包含说话者ID以及开始和结束时间戳的分段的有序列表。

最后一步:开始倾听

我们要做的最后一件事是告诉麦克风音频源开始收听:

print("Listening...")
source.read()

这将立即开始将音频从麦克风发送到我们的源对象,并将它们重定向到我们的管道。

结论

在这篇文章中,我们将diart流式扬声器日记库与OpenAI的Whisper相结合,以获得实时的扬声器彩色转录。

为了方便起见,完整的脚本可以在这个GitHub要点中找到。

我希望这对你们所有想要获得高质量的转录和日记流媒体工具的人都有用。

如果你认为我遗漏了什么或可以改进的地方,请随时在GitHub上的日记库中留下你的评论或打开一个问题/讨论!您想在库中查看流式ASR功能吗?我们一直在寻找新的想法和贡献者!

文章链接