今日やった事

実装調査

実装の調査をする

今日はsouddriveで受け取ったデータを、発話検出モデルに渡せるようにする。それぞれの入力、出力で必要なデータの形式に変換する事が必要だ。少しまとめてみよう。

役割	部品	入力	出力
音声取得	sounddevice	-	np.ndarray[np.ndarray[np.int16]]
発話検出	snakers4/silero-vad	torch.Tensor	torch.Tensor
文字起こし	reazon-research/reazonspeech-nemo-v2'	AudioData	AudioResult?

おそらく僕は np.ndarray[np.ndarray[np.int16]] を torch.Tensor に変換したいのだろう。そういう場合は torch.from_numpy を使う。調べるとdtypeはtorch.float32である必要がある。

import time
import sys
import sounddevice
import numpy as np
import torch
from reazonspeech.nemo.asr.audio import audio_from_tensor
from nemo.collections.asr.models import EncDecRNNTBPEModel
from reazonspeech.nemo.asr.transcribe import transcribe


torch.set_num_threads(1)

THREASHOLD = 0.1  # 発話検出の閾値 (ex: 0.5)
SAMPLE_RATE = 16000
SAMPLING_RATE = 16000
WINDOW_SIZE_SAMPLES = 512   # 512 if SAMPLING_RATE == 16000 else 256

model, utils = torch.hub.load("snakers4/silero-vad", "silero_vad")
rez_model = EncDecRNNTBPEModel.from_pretrained(
    'reazon-research/reazonspeech-nemo-v2',
    map_location="cpu")

(get_speech_timestamps,
 save_audio,
 read_audio,
 VADIterator,
 collect_chunks) = utils

model.reset_states()

speech_intervals = []
current_start = None
speech_data = None

speech = None
chunk = None

def callback_sound(indata: np.ndarray[np.ndarray[np.int16]],
                   outdata, frames, time, status):
    global speech
    audio_chunk = indata

    audio_int16 = np.frombuffer(audio_chunk, dtype=np.int16)
    abs_max = np.abs(audio_int16).max()
    sound = audio_int16.astype('float32')
    if abs_max > 0:
        sound *= 1/32768
    audio_float32 = sound.squeeze()

    x = chunk = torch.Tensor(audio_float32)
    # chunk: torch.Tensor = torch.from_numpy(  # torch.Tensor torch.float32
    #     indata.astype(np.float32))

    window_size_samples = len(x[0]) if x.dim() == 2 else len(x)

    speech_prob = model(chunk, SAMPLING_RATE).item()
    print(speech_prob)
    if speech_prob > THREASHOLD:  # 発話検出
        if speech is None:
            speech = chunk
        else:
            speech = torch.cat((speech, chunk), dim=0)
        return
    elif speech is not None:
        print(speech)
        if speech.shape[0] >= WINDOW_SIZE_SAMPLES :
            audio = audio_from_tensor(speech, SAMPLING_RATE)
            ret = transcribe(rez_model, audio)
            for text in ret.segments:
                print(text, end="")
            print("\n")
            speech = None

with sounddevice.Stream(samplerate=SAMPLE_RATE, dtype="int16",
                        channels=1, blocksize=WINDOW_SIZE_SAMPLES,
                        callback=callback_sound):
    while True:
        time.sleep(0.01)

精度はともかく、動作する所までは確認できた。また一つ、やりたい事が達成できた気がする。また一つ頭の悪い活動ができた。それが少し嬉しい。自分自身のコピーをコンピュータの上に構築するための取り組みの一つとして、これは必ず必要だった。

残骸

ここに至る前の残骸を掲載しておく。

import time
import sys
import sounddevice
import numpy as np

SAMPLE_RATE = 16000


def callback_sound(indata: np.ndarray[np.ndarray[np.int16]], outdata, frames, time, status):
    print(indata)
    
with sounddevice.Stream(samplerate=SAMPLE_RATE, dtype="int16",
                        channels=1, blocksize=512,
                        callback=callback_sound):
    while True:
        time.sleep(0.01)

import torch
torch.set_num_threads(1)

SAMPLING_RATE = 16000
model, utils = torch.hub.load("snakers4/silero-vad", "silero_vad")
(get_speech_timestamps,
 save_audio,
 read_audio,
 VADIterator,
 collect_chunks) = utils
wav = read_audio("zun.wav", sampling_rate=SAMPLING_RATE)

本を読む

図書館で本を読む事にした。https://blog.symdon.info/posts/1724291055/#headline-2 の取り組みを使う。ただし、多少、スクリプトを改良した。

import time
import sys
import subprocess
import MeCab
import ipadic

tagger = MeCab.Tagger(ipadic.MECAB_ARGS)
node = tagger.parseToNode(sys.stdin.read())
sys.stdin.read()

sentence = ""

with subprocess.Popen(["/usr/bin/say", "--rate", "250"],
                      stdin=subprocess.PIPE,
                      text=True) as p:
    while node:
        word = node.surface
        feature = node.feature.split(",")
    
        hinshi = feature[0]
        detail = feature[1]
    
        if hinshi == "記号" and detail == "句点":
            p.stdin.write(sentence)
            sentence = ""
        else:
            sentence += word
        node = node.next

time.sleep(10)

このスクリプト名 pdfreader.py はあまり良くないかもしれない。実質的にPDFを読んでいるのはpopplerのpdftotextであって、スクリプトは標準入力のデータを読み上げコマンドに渡し、読み上げコマンドのプロセスを管理しているに過ぎない。だからPDF Readerではない。

実行はこのようにする。

pdftotext -f 35 -l 35 foo.pdf - | python pdfreader.py

もう一度本を読む

5ページだけ読んだ。