Update

Akito-UzukiP · Jan 4, 2024 · c7571a7 · c7571a7
1 parent b03f68e
commit c7571a7
Show file tree

Hide file tree

Showing 12 changed files with 213 additions and 54 deletions.
diff --git a/README.md b/README.md
@@ -87,14 +87,16 @@ model_assets
 
 - デフォルトスタイル「Neutral」以外のスタイルを使いたい人向けです。
 - `Style.bat`をダブルクリックか`python webui_style_vectors.py`するとWebUIが起動します。
-- 学習とは独立しているので、学習中でもできるし、学習が終わっても何度もやりなおせます。
-- スタイルについての詳細は[clustering.ipynb](clustering.ipynb)を参照してください。
+- 学習とは独立しているので、学習中でもできるし、学習が終わっても何度もやりなおせます（前処理は終わらせている必要があります）。
+- スタイルについての仕様の詳細は[clustering.ipynb](clustering.ipynb)を参照してください。
 
 ### データセット作り
 
 - `Dataset.bat`をダブルクリックか`python webui_dataset.py`すると、音声ファイルからデータセットを作るためのWebUIが起動します。音声ファイルのみからでもこれを使って学習できます。
 
-注意: データセットの手動修正やノイズ除去や、より高品質なデータセットを作りたい場合は、[Aivis](https://github.com/tsukumijima/Aivis)や、そのデータセット部分のWindows対応版 [Aivis Dataset](https://github.com/litagin02/Aivis-Dataset) を使うのをおすすめします。
+注意: データセットの手動修正やノイズ除去等、細かい修正を行いたい場合は[Aivis](https://github.com/tsukumijima/Aivis)や、そのデータセット部分のWindows対応版 [Aivis Dataset](https://github.com/litagin02/Aivis-Dataset) を使うといいかもしれません。ですがファイル数が多い場合などは、このツールで簡易的に切り出してデータセットを作るだけでも十分という気もしています。
+
+データセットがどのようなものがいいかは各自試行錯誤中してください。
 
 ### API Server
 

diff --git a/colab.ipynb b/colab.ipynb
@@ -54,11 +54,23 @@
         "!python initialize.py"
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Google driveを使う方はこちらを実行してください。\n",
+        "\n",
+        "from google.colab import drive\n",
+        "drive.mount(\"/content/drive\")"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## 1. 初期設定\n",
+        "## 1. 初期設定（とデータセット作成）\n",
         "\n",
         "学習とその結果を保存するディレクトリ名を指定します。\n",
         "Google driveの場合はそのまま実行、カスタマイズしたい方は変更して実行してください。"
@@ -70,38 +82,49 @@
       "metadata": {},
       "outputs": [],
       "source": [
+        "# 学習に必要なファイルや途中経過が保存されるディレクトリ\n",
         "dataset_root = \"/content/drive/MyDrive/Style-Bert-VITS2/Data\"\n",
+        "\n",
+        "# 学習結果（音声合成に必要なファイルたち）が保存されるディレクトリ\n",
         "assets_root = \"/content/drive/MyDrive/Style-Bert-VITS2/model_assets\"\n",
         "\n",
         "import yaml\n",
         "\n",
+        "\n",
         "with open(\"configs/paths.yml\", \"w\", encoding=\"utf-8\") as f:\n",
+        "\n",
         "    yaml.dump({\"dataset_root\": dataset_root, \"assets_root\": assets_root}, f)"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## 2. Google Driveとの連携とデータの配置\n",
+        "### 音声ファイルからのデータセットの作成（ある人はスキップ可）\n",
         "\n",
-        "Google driveにデータを保存する方は、次のセルを実行して、Google driveと連携します。"
+        "音声ファイル（1ファイル2-12秒程度）とその書き起こしのデータセットを持っていない方は、音声ファイルのみから以下の手順でデータセットを作成することができます。デフォルトではGoogle drive上の`Style-Bert-VITS2/inputs/`に音声ファイル（wavファイル形式、1ファイルでも複数ファイルでも可）を置いて、下を実行すると、データセットが作られます。"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "xzbmRmVfP29m",
-        "outputId": "bd492fba-3cba-4002-b6d9-f144f701fce6"
-      },
+      "metadata": {},
       "outputs": [],
       "source": [
-        "from google.colab import drive\n",
-        "drive.mount(\"/content/drive\")"
+        "# 元となる音声ファイル（wav形式）を入れるディレクトリ\n",
+        "input_dir = \"/content/drive/MyDrive/Style-Bert-VITS2/inputs\"\n",
+        "# モデル名（話者名）を入力\n",
+        "model_name = \"your_model_name\"\n",
+        "\n",
+        "!python slice.py -i {input_dir} -o {dataset_root}/{model_name}/raw\n",
+        "!python transcribe.py -i {dataset_root}/{model_name}/raw -o {dataset_root}/{model_name}/esd.list --speaker_name {model_name} --compute_type float16"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 2. データの配置"
       ]
     },
     {
@@ -186,7 +209,7 @@
         "# 上でつけたフォルダの名前`Data/{model_name}/`\n",
         "model_name = \"your_model_name\"\n",
         "\n",
-        "# 学習のバッチサイズ。4ぐらいが最適か。VRAMのはみ出具合に応じて調整してください。\n",
+        "# 学習のバッチサイズ。VRAMのはみ出具合に応じて調整してください。\n",
         "batch_size = 4\n",
         "\n",
         "# 学習のエポック数（データセットを合計何周するか）。\n",

diff --git a/configs/config.json b/configs/config.json
@@ -66,5 +66,5 @@
     "use_spectral_norm": false,
     "gin_channels": 256
   },
-  "version": "1.2"
+  "version": "1.3"
 }
diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md
@@ -1,9 +1,20 @@
 # Changelog
 
+## v1.3
+
+- `Dataset.bat`の音声スライスと書き起こしをよりカスタマイズできるように（秒数指定や書き起こしのWhisperモデル指定や言語指定等）
+- `Style.bat`のスタイル作成で、新しい方法（DBSCAN）を追加（こちらのほうがスタイル数を指定できない代わりに特徴がよく出るかもしれません）。
+- クラウド実行等の際にパスの指定をこちらでできるように、パスの設定を`configs/paths.yml`にまとめました（colabの[ノートブック](http://colab.research.google.com/github/litagin02/Style-Bert-VITS2/blob/master/colab.ipynb)もそれに伴って変えたので新しいものを使ってください）。デフォルトは`dataset_root: Data`と`assets_root: model_assets`です。クラウド等でやる方はここを変更してください。
+- どのステップ数の出力がよいかの「一つの」指標として [SpeechMOS](https://github.com/tarepan/SpeechMOS) を使うスクリプトを追加：
+```bash
+python speech_mos.py -m <model_name> [-o <output_csv_file>]
+```
+ステップごとの自然性評価が表示されるはず。読み上げさせたい文章を変えたかったら中のファイルを弄って各自調整してください。またあくまで目安のひとつなので、実際に読み上げさせて選別するのが一番だと思います。
+
 ## v1.2 (2023-12-31)
 
 - グラボがないユーザーでの音声合成をサポート、`Install-Style-Bert-VITS2-CPU.bat`でインストール。
-- Google Colabでの学習をサポート、[ノートブック](../colab.ipynb)を追加
+- Google Colabでの学習をサポート、[ノートブック](http://colab.research.google.com/github/litagin02/Style-Bert-VITS2/blob/master/colab.ipynb)を追加
 - 音声合成のAPIサーバーを追加、`python server_fastapi.py`で起動します。API仕様は起動後に`/docs`にて確認ください。（ @darai0512 様によるPRです、ありがとうございます！）
 - 学習時に自動的にデフォルトスタイル Neutral を生成するように。特にスタイル指定が必要のない方は、学習したらそのまま音声合成を試せます。これまで通りスタイルを自分で作ることもできます。
 - マージ機能の新規追加: `Merge.bat`, `webui_merge.py`

diff --git a/docs/CLI.md b/docs/CLI.md
@@ -0,0 +1,56 @@
+# CLI
+
+
+
+## Dataset
+
+`Dataset.bat` webui (`python webui_dataset.py`) consists of **slice audio** and **transcribe wavs**.
+
+### Slice audio
+
+```bash
+python slice.py -i <input_dir> -o <output_dir> -m <min_sec> -M <max_sec>
+```
+
+Required:
+- `input_dir`: Path to the directory containing the audio files to slice.
+- `output_dir`: Path to the directory where the sliced audio files will be saved.
+
+Optional:
+- `min_sec`: Minimum duration of the sliced audio files in seconds (default 2).
+- `max_sec`: Maximum duration of the sliced audio files in seconds (default 12).
+
+### Transcribe wavs
+
+```bash
+python transcribe.py -i <input_dir> -o <output_file> --speaker_name <speaker_name>
+```
+
+Required:
+- `input_dir`: Path to the directory containing the audio files to transcribe.
+- `output_file`: Path to the file where the transcriptions will be saved.
+- `speaker_name`: Name of the speaker.
+
+Optional
+- `--initial_prompt`: Initial prompt to use for the transcription (default value is specific to Japanese).
+- `--device`: `cuda` or `cpu` (default: `cuda`).
+- `--language`: `jp`, `en`, or `en` (default: `jp`).
+- `--model`: Whisper model, default: `large-v3`
+- `--compute_type`: default: `bfloat16`
+
+## Train
+
+`Train.bat` webui (`python webui_train.py`) consists of the following.
+
+### Preprocess audio
+```bash
+python resample.py -i <input_dir> -o <output_dir> [--normalize] [--trim]
+```
+
+Required:
+- `input_dir`: Path to the directory containing the audio files to preprocess.
+- `output_dir`: Path to the directory where the preprocessed audio files will be saved.
+
+TO BE WRITTEN (WIP)
+
+これいる？
diff --git a/docs/tutorial.md b/docs/tutorial.md
diff --git a/resample.py b/resample.py
@@ -56,12 +56,14 @@ def process(item):
     )
     parser.add_argument(
         "--in_dir",
+        "-i",
         type=str,
         default=config.resample_config.in_dir,
         help="path to source dir",
     )
     parser.add_argument(
         "--out_dir",
+        "-o",
         type=str,
         default=config.resample_config.out_dir,
         help="path to target dir",

diff --git a/slice.py b/slice.py
@@ -9,7 +9,6 @@
 
 from common.log import logger
 from common.stdout_wrapper import SAFE_STDOUT
-from resample import normalize_audio
 
 vad_model, utils = torch.hub.load(
     repo_or_dir="snakers4/silero-vad",
@@ -59,7 +58,6 @@ def split_wav(
     min_sec=2,
     max_sec=12,
     min_silence_dur_ms=700,
-    normalize=False,
 ):
     margin = 200  # ミリ秒単位で、音声の前後に余裕を持たせる
     speech_timestamps = get_stamps(
@@ -87,9 +85,6 @@ def split_wav(
         end_sample = int(end_ms / 1000 * sr)
         segment = data[start_sample:end_sample]
 
-        if normalize:
-            segment = normalize_audio(segment, sr)
-
         sf.write(os.path.join(target_dir, f"{file_name}-{i}.wav"), segment, sr)
         total_time_ms += end_ms - start_ms
 
@@ -113,14 +108,11 @@ def split_wav(
     )
     parser.add_argument(
         "--output_dir",
-        "-t",
+        "-o",
         type=str,
         default="raw",
         help="Directory of output wav files",
     )
-    parser.add_argument(
-        "--normalize", action="store_true", help="Whether to normalize loudness"
-    )
     parser.add_argument(
         "--min_silence_dur_ms",
         "-s",
@@ -135,7 +127,6 @@ def split_wav(
     min_sec = args.min_sec
     max_sec = args.max_sec
     min_silence_dur_ms = args.min_silence_dur_ms
-    normalize = args.normalize
 
     wav_files = Path(input_dir).glob("**/*.wav")
     wav_files = list(wav_files)
@@ -151,7 +142,6 @@ def split_wav(
             min_sec=min_sec,
             max_sec=max_sec,
             min_silence_dur_ms=min_silence_dur_ms,
-            normalize=normalize,
         )
         total_sec += time_sec
 

diff --git a/speech_mos.py b/speech_mos.py
@@ -7,6 +7,7 @@
 import pandas as pd
 import torch
 from tqdm import tqdm
+import numpy as np
 
 from common.log import logger
 from common.tts_model import Model
@@ -35,8 +36,7 @@
 
 parser = argparse.ArgumentParser()
 parser.add_argument("--model_name", "-m", type=str, required=True)
-parser.add_argument("--device", "-d", type=str, default="cuda")
-parser.add_argument("--output", "-o", type=str, default="mos.csv")
+parser.add_argument("--device", "-d", type=str, default="cuda"))
 
 args = parser.parse_args()
 
@@ -84,12 +84,14 @@ def get_model(model_file: Path):
 for model_file, step, scores in results:
     logger.info(f"{model_file}: {scores[-1]}")
 
-with open(args.output, "w", encoding="utf-8", newline="") as f:
+with open(f"mos_{model_name}.csv", "w", encoding="utf-8", newline="") as f:
     writer = csv.writer(f)
     writer.writerow(["model_path"] + ["step"] + test_texts + ["mean"])
     for model_file, step, scores in results:
         writer.writerow([model_file] + [step] + scores)
 
+logger.info(f"mos_{model_name}.csv has been saved.")
+
 # step countと各MOSの値を格納するリストを初期化
 steps = []
 mos_values = []
@@ -117,8 +119,24 @@ def get_model(model_file: Path):
 plt.xlabel("Step Count")
 plt.ylabel("MOS")
 
+# ステップ数の軸ラベルを1000単位で表示するように調整
+plt.xticks(
+    ticks=np.arange(0, max(steps) + 1000, 1000),
+    labels=[f"{int(x/1000)}k" for x in np.arange(0, max(steps) + 1000, 5000)],
+)
+
+# 縦の補助線を追加
+plt.grid(True, axis="x")
+
+# 凡例をグラフの外側に配置
+plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))
+
 # 凡例を表示
 plt.legend()
 
 # グラフを表示
 plt.show()
+
+plt.savefig(f"mos_{model_name}.png", bbox_inches="tight")
+
+logger.info(f"mos_{model_name}.png has been saved.")
diff --git a/transcribe.py b/transcribe.py
@@ -19,10 +19,10 @@ def transcribe(wav_path, initial_prompt=None, language="ja"):
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument("--input_dir", type=str, default="raw")
-    parser.add_argument("--output_file", type=str, default="esd.list")
+    parser.add_argument("--input_dir", "-i", type=str, default="raw")
+    parser.add_argument("--output_file", "-o", type=str, default="esd.list")
     parser.add_argument(
-        "--initial_prompt", type=str, default="こんにちは。元気、ですかー？私は……ちゃんと元気だよ！"
+        "--initial_prompt", type=str, default="こんにちは。元気、ですかー？ふふっ、私は……ちゃんと元気だよ！"
     )
     parser.add_argument(
         "--language", type=str, default="ja", choices=["ja", "en", "zh"]
@@ -43,13 +43,15 @@ def transcribe(wav_path, initial_prompt=None, language="ja"):
     device = args.device
     compute_type = args.compute_type
 
+    os.makedirs(os.path.dirname(output_file), exist_ok=True)
+
     logger.info(
         f"Loading Whisper model ({args.model}) with compute_type={compute_type}"
     )
     try:
         model = WhisperModel(args.model, device=device, compute_type=compute_type)
     except ValueError as e:
-        logger.warning(f"Failed to load model: {e}")
+        logger.warning(f"Failed to load model, so use `auto` compute_type: {e}")
         model = WhisperModel(args.model, device=device)
 
     wav_files = [