Commit
·
9a604e2
1
Parent(s):
97eec03
add JVS and VCTK models
Browse files- models/tts/tungnaa_117_jvs.ckpt +3 -0
- models/tts/tungnaa_117_jvs.md +45 -0
- models/tts/tungnaa_119_vctk.ckpt +3 -0
- models/tts/tungnaa_119_vctk.md +46 -0
- models/vocoder/042-jvs-100m-xfermulti_0abe2b072b_streaming_norm.ts +3 -0
- models/vocoder/046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc_streaming_norm.ts +3 -0
models/tts/tungnaa_117_jvs.ckpt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2e30ae3127807958238196228b76d493de4f5f8483364b15161afddc43eaa80f
|
| 3 |
+
size 1711942430
|
models/tts/tungnaa_117_jvs.md
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
block_size: 2048
|
| 3 |
+
sample_rate: 44100
|
| 4 |
+
latent_size: 12
|
| 5 |
+
vocoder: "042-jvs-100m-xfermulti_0abe2b072b_streaming_norm.ts"
|
| 6 |
+
dataset: "John Van Stan (LibriTTS)"
|
| 7 |
+
vocoder_type: "RAVE"
|
| 8 |
+
alignment_type: "DCA"
|
| 9 |
+
likelihood_type: "NSF"
|
| 10 |
+
text_encoder_type: "CANINE"
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# tungnaa_116_jvs
|
| 14 |
+
|
| 15 |
+
### dimensions
|
| 16 |
+
|
| 17 |
+
block size: 2048
|
| 18 |
+
|
| 19 |
+
sample rate: 44100
|
| 20 |
+
|
| 21 |
+
latent size: 12
|
| 22 |
+
|
| 23 |
+
### dataset
|
| 24 |
+
|
| 25 |
+
JVS (Hi-Fi TTS speaker 9017)
|
| 26 |
+
|
| 27 |
+
### vocoder
|
| 28 |
+
|
| 29 |
+
`models/vocoder/042-jvs-100m-xfermulti_0abe2b072b_streaming_norm.ts`
|
| 30 |
+
|
| 31 |
+
### training
|
| 32 |
+
|
| 33 |
+
tungnaa commit `09ecdcd532eac3d454a8b4e28e896bca5bccbf9f`
|
| 34 |
+
|
| 35 |
+
```bash
|
| 36 |
+
tungnaa trainer --experiment 117-jvs-e2emulti-mask-ends --model-dir /data/users/victor/ivoice-models --log-dir /data/users/victor/ivoice-logs --manifest /data/users/victor/tmp/ivoice_prep_100m_0abe_multi/9017_manifest_clean_train.json --rave-model /data/users/victor/rave-v2/runs/042-jvs-100m-xfermulti_0abe2b072b/version_0/checkpoints/042-jvs-100m-xfermulti_0abe2b072b_streaming_norm.ts --lr 3e-4 --lr-text 3e-5 --epoch-size 200 --save-epochs 20 --device cuda:0 train
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
### notes
|
| 40 |
+
|
| 41 |
+
trained with full JVS dataset, no annotations.
|
| 42 |
+
|
| 43 |
+
uses a 12-dimensional vocoder trained with a subset of JVS, fine tuned from a multivoice model.
|
| 44 |
+
|
| 45 |
+
this model uses a neural spline flow likelihood.
|
models/tts/tungnaa_119_vctk.ckpt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8f83755dfa999b03881a1a4386cad9d5ad3c89453c925ec020bba2a602906165
|
| 3 |
+
size 1711642462
|
models/tts/tungnaa_119_vctk.md
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
block_size: 2048
|
| 3 |
+
sample_rate: 48000
|
| 4 |
+
latent_size: 11
|
| 5 |
+
vocoder: "046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc_streaming_norm.ts"
|
| 6 |
+
dataset: "VCTK"
|
| 7 |
+
vocoder_type: "RAVE"
|
| 8 |
+
alignment_type: "DCA"
|
| 9 |
+
likelihood_type: "NSF"
|
| 10 |
+
text_encoder_type: "CANINE"
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# tungnaa_119_vctk
|
| 14 |
+
|
| 15 |
+
### dimensions
|
| 16 |
+
|
| 17 |
+
block size: 2048
|
| 18 |
+
|
| 19 |
+
sample rate: 48000
|
| 20 |
+
|
| 21 |
+
latent size: 11
|
| 22 |
+
|
| 23 |
+
### dataset
|
| 24 |
+
|
| 25 |
+
VCTK
|
| 26 |
+
|
| 27 |
+
### vocoder
|
| 28 |
+
|
| 29 |
+
`models/vocoder/046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc_streaming_norm.ts`
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
### training
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
tungnaa prep --datasets '{kind:"vctk", path:"/data/datasets/VCTK"}' --rave-path /data/users/victor/rave-v2/runs/046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc/version_0/checkpoints/046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc_streaming_norm.ts --out-path /data/users/victor/tmp/ivoice_prep_824a/
|
| 36 |
+
|
| 37 |
+
tungnaa trainer --experiment 119-vctk --model-dir /data/users/victor/ivoice-models --log-dir /data/users/victor/ivoice-logs --manifest /data/users/victor/tmp/ivoice_prep_824a/vctk.json --concat-speakers 2 --speaker-annotate --device cuda:1 --batch-size 32 --rave-model /data/users/victor/rave-v2/runs/046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc/version_0/checkpoints/046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc_streaming_norm.ts --lr 3e-4 --lr-text 3e-5 --epoch-size 200 --save-epochs 20 train
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
### notes
|
| 41 |
+
|
| 42 |
+
trained with concatation of utterance pairs plus speaker annotations. example syntax: `[p225] this is an utterance. [p330] this is another.`
|
| 43 |
+
|
| 44 |
+
uses a multi-dataset vocoder which was *not* fine tuned to only VCTK, so it should have a lot of play in the latent biases.
|
| 45 |
+
|
| 46 |
+
this model uses a neural spline flow likelihood.
|
models/vocoder/042-jvs-100m-xfermulti_0abe2b072b_streaming_norm.ts
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4ab33a3050e7269b20b455438811662231a42a825648400025e57720c39061ee
|
| 3 |
+
size 149351311
|
models/vocoder/046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc_streaming_norm.ts
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5a1b28220e41a9148147286ea80f7286c8c249862639ba985f776492d4631845
|
| 3 |
+
size 150205512
|