ftyers.github.io/deepspeech.html at master · ftyers/ftyers.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
<html>
<head><title>Installing and training DeepSpeech</title>
</head>
<body>

Last updated: <tt>dg. de juny 23 05:43:28 BST 2019</tt>

<h3>Download DeepSpeech code:</h3>

<pre>
# Need git Large File Storage for deepspeech
$ apt-get install git-lfs
# you only need to do this once to initialize git LFS
$ git lfs install
$ git lfs clone --depth 1 https://github.com/mozilla/DeepSpeech.git
</pre>

<b>NOTE:</b> You only need <tt>git lfs</tt> if you are planning to use the bundled language models

<h3>Setup a virtualenv:</h3>

<pre>
$ virtualenv -p python3 $HOME/tmp/deepspeech-venv/

$ source $HOME/tmp/deepspeech-venv/bin/activate
</pre>

<h3>Install DeepSpeech python bindings:</h3>
<pre>
$ pip3 install deepspeech
$ pip3 install six
$ cd DeepSpeech
$ cd native_client
$ python3 ../util/taskcluster.py --target .
$ cd ..
</pre>

<h3>Install a tonne of requirements for training</h3>
<pre>
$ pip3 install -r requirements.txt
$ pip3 install $(python3 util/taskcluster.py --decoder)
</pre>

Do you have a GPU ? If so run:

<pre>
pip3 uninstall tensorflow
pip3 install 'tensorflow-gpu==1.13.1'
</pre>

You'll also need to install version 10.0 of CUDA and version 7.5 of CUDNN... note that this isn't free software
so you'll probably have to fill in some NVidia spam form.

<h3>Install kenlm:</h3>
<pre>
$ wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
$ sudo apt-get install libboost-all-dev libeigen3-dev zlib1g-dev
</pre>

Then:

<pre>
$ mkdir kenlm/build
$ cd kenlm/build
$ cmake ..
$ make -j2
$ cd ../../
</pre>

<h3>Setup the data</h3>
<!-- $ python3 bin/import_digits.py ~/source/Turkic_TTS/corpus/chv/speakers/digits/ chv -->
<pre>
$ cd data
$ wget http://ilazki.thinkgeek.co.uk/~spectre/chv_digits.tar.gz
$ tar -xzvf chv_digits.tar.gz
$ cd chv_digits
$ mv deepspeech/* .

# rename the paths to the audio files
$ sed -Ei 's/\/frans\/home\/path\//\/YOUR\/ABS\/PATH\/g' *.csv
</pre>

If you want to set up the language model yourself you can use:
<pre>
$ cat vocab.txt | ~/source/kenlm/build/bin/lmplz --discount_fallback -o 3 --arpa vocab.arpa
$ ../../kenlm/build/bin/build_binary -T -s vocab.arpa lm.binary
$ ../../native_client/generate_trie alphabet.txt lm.binary vocab.txt trie
</pre>

Otherwise continue:

<pre>
$ cd ../../
$ mkdir models
</pre>

<h3> Train the model </h3>
<pre>
$ time python3 DeepSpeech.py --train_files data/chv_digits/chv_digits.train.csv --dev_files data/chv_digits/chv_digits.dev.csv --test_files data/chv_digits/chv_digits.test.csv --alphabet_config_path data/chv_digits/alphabet.txt --lm_binary_path data/chv_digits/lm.binary --lm_trie_path data/chv_digits/trie --validation_step 1 --test_batch_size 5 --dev_batch_size 15 --early_stop True --export_dir models/digits --epoch 100 --report_count 100 --n_hidden 494 --learning_rate 0.00095  --dropout_rate 0.22 --max_to_keep 2 --log_level 0 --lm_weight 5 --word_count_weight 1.0 --valid_word_count_weight 1.0
</pre>

<h3> Training with Common Voice </h3>

You can go <a href="https://voice.mozilla.org/en/datasets">here</a> to download data from Common Voice, let's click on Chuvash.

You will have to fill out your email address and click on two tickboxes. You can use whatever email address, it doesn't have to
be real.

You can maybe even just click on <a href="https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-2/cv.tar.gz">this link</a>.

After you download the <tt>cv.tar.gz</tt> file, make a new directory in your <tt>data/</tt> directory and uncompress the data:

<pre>
$ mkdir data/chv
$ mv cv.tar.gz data/chv
$ tar -xzf cv.tar.gz
$ cat *.tsv | cut -f3 | tail +2 | sed 's/./ &amp; /g' | sed 's/  */ /g' | tr ' ' '\n' | uconv -x lower | sort -ur &gt; alphabet.txt

</pre>

<b>WARNING:</b> Make sure that the space, <tt>' '</tt> symbol is the last symbol in the file not the first.

You then need to run the <tt>bin/import_cv2.py</tt> import script:

<pre>
$ cd ../../
$ bin/import_cv2.py --filter_alphabet data/chv/alphabet.txt /home/ftyers/source/DeepSpeech/data/chv/

</pre>

<b>WARNING:</b> The path should be an absolute path.

Train the LM:

<pre>
$ ../../kenlm/build/bin/lmplz --discount_fallback -o 3 --arpa &lt; lm/chv.txt lm/chv.arpa
$ ../../kenlm/build/bin/build_binary -T -s lm/chv.arpa lm/lm.binary
$ ../../native_client/generate_trie alphabet.txt lm/lm.binary lm/trie
</pre>

The file <tt>lm/chv.txt</tt> is the plain text corpus.

Then you can train the model:

<pre>
$ python3 DeepSpeech.py --train_files data/chv/clips/train.csv --dev_files data/chv/clips/dev.csv --test_files data/chv/clips/test.csv --alphabet_config_path data/chv/alphabet.txt --lm_binary_path data/chv/lm/lm.binary --lm_trie_path data/chv/lm/trie

</pre>

<h3> Troubleshooting </h3>

<h4> Check your data </h4>

Sometimes your data might be too quiet, or there might be too little amplitude difference between the
speech parts and the silence parts. You can check this using <tt>sox</tt>:

Amplitude difference is too low:
<pre>
$ sox 0001.wav -n stat
Samples read:            610816
Length (seconds):     13.850703
Scaled by:         2147483647.0
Maximum amplitude:     0.037994
Minimum amplitude:    -0.048950
Midline amplitude:    -0.005478
Mean    norm:          0.002312
Mean    amplitude:     0.000000
RMS     amplitude:     0.005450
Maximum delta:         0.010834
Minimum delta:         0.000000
Mean    delta:         0.000222
RMS     delta:         0.000594
Rough   frequency:          765
Volume adjustment:       20.429
</pre>

Note the small difference between the maximum amplitude and the minimum.

In this example the amplitude difference is ok:
<pre>
$ sox 0001.wav -n stat
Samples read:            610816
Length (seconds):     13.850703
Scaled by:         2147483647.0
Maximum amplitude:     0.785187
Minimum amplitude:    -1.000000
Midline amplitude:    -0.107407
Mean    norm:          0.047779
Mean    amplitude:     0.000002
RMS     amplitude:     0.112630
Maximum delta:         0.223907
Minimum delta:         0.000000
Mean    delta:         0.004599
RMS     delta:         0.012279
Rough   frequency:          765
Volume adjustment:        1.000
</pre>

You can increase the gain using <tt>sox 0001.wav gain -n 0.1</tt>.

<!--

<elpimous_robot> sox /home/nvidia/DeepSpeech/data/chv_digits/0240.wav /home/nvidia/DeepSpeech/test_wav.wav gain -n 0.1
<elpimous_robot> avec un gain de 0.1, meilleure courbe d'amplitude

<spectie> et le delta doit être ?
<elpimous_robot> amplitude moyenne +- 0.5
<elpimous_robot> ici, il y a 0.99, du a la 1ere wave haute
<elpimous_robot> mais cette config me parait bonne !
-->


</body>


</html>
<!--

python -u DeepSpeech.py \
  ##train_files /home/nvidia/DeepSpeech/data/alfred/train/train.csv \
  ##dev_files /home/nvidia/DeepSpeech/data/alfred/dev/dev.csv \
  ##test_files /home/nvidia/DeepSpeech/data/alfred/test/test.csv \
  ##train_batch_size 90 \
  ##dev_batch_size 80 \
  ##test_batch_size 70 \
  ##n_hidden 375 \
  ##epoch 400 \
  ##validation_step 1 \
  ##early_stop True \  # early stop activated
  ##earlystop_nsteps 8 \ # if validation stop doesn't shut down after 8, stop
  ##estop_mean_thresh 0.001 \ # thin params for early stop nsteps, to knowthe loss position
  ##estop_std_thresh 0.001 \ # same
  ##dropout_rate 0.012 \ # a param to avoid overfitting. (works for me)
  ##learning_rate 0.001 \ # learning speed (too small = too long ; to high = we'll perhaps miss the best loss point value
  ##beam_width 1024 \ # number of probabilities to keep in memory, to choose the best result (hughter = more memory
  ##lm_weight 5 \ # params for LM/trie integration in model creation
  ##word_count_weight 1.0 \ # same
  ##valid_word_count_weight 1.0 \ same
  ##export_dir /home/nvidia/DeepSpeech/data/alfred/results/model_export/ \
  ##checkpoint_dir /home/nvidia/DeepSpeech/data/alfred/results/checkout/ \
  ##decoder_library_path /home/nvidia/tensorflow/bazel-bin/native_client/libctc_decoder_with_kenlm.so \
  ##alphabet_config_path /home/nvidia/DeepSpeech/data/alphabet.txt \
  ##lm_binary_path /home/nvidia/DeepSpeech/data/lm.binary \
  ##lm_trie_path /home/nvidia/DeepSpeech/data/trie \
  "$@"

-->