Word Error Rate Algorithm - J.Notebook 1.0

WER

Words error rate is a common metric of the performance of a speech recognition.

The formula is WER = ( I + D + S) / N.

Given an original text, a recognition text with a length of N words,

S: number of substitutions
D: number of deletions
I: number of insertions
N: total number of words
The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level.

Levenshtein Distance

Levenshtein distance is used to calculate the minimum steps.
For instance, a reference string "abcdef" and a hypothsis string "azced" has three minimum difference steps.

		a	b	c	d	e	f
	0	1	2	3	4	5	6
a	1	0	1	2	3	4	5
z	2	1	1	2	3	4	5
c	3	2	2	1	2	3	4
e	4	3	3	2	2	2	3
d	5	4	4	3	2	3	3

Substitute: d[i-1]d[j-1] + 1
insert: d[i]d[j-1] + 1 (up, down)
delete: d[i-1]d[j] + 1 (right, left)

We can trace back the result from the bottom most right cell, as indicated by the bold numbers.

f → d

d x

b → z

Modification in WER on word level

		their	fresh	new	results
	0	1	2	3	4
their	1	0	1	2	3
first	2	1	1	2	3
few	3	2	2	2	3
results	4	3	3	3	2

new → few

fresh → first

		what	extent
	0	1	2
what's	1	1	2
the	2	2	2
extent	3	3	2

reference: this youtube video has nice explanation :
https://www.youtube.com/watch?v=We3YDTzNXEk

python code for WER and alignment:
https://github.com/zszyellow/WER-in-python