Words error rate is a common metric of the performance of a speech recognition.

The formula is WER = ( I + D + S) / N.

Given an original text, a recognition text with a length of N words,

  • S: number of substitutions
  • D: number of deletions
  • I: number of insertions
  • N: total number of words

  • The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level.

Levenshtein Distance

Levenshtein distance is used to calculate the minimum steps.
For instance, a reference string "abcdef" and a hypothsis string "azced" has three minimum difference steps.

a b c d e f
0 1 2 3 4 5 6
a 1 0 1 2 3 4 5
z 2 1 1 2 3 4 5
c 3 2 2 1 2 3 4
e 4 3 3 2 2 2 3
d 5 4 4 3 2 3 3
  • Substitute: d[i-1]d[j-1] + 1
  • insert: d[i]d[j-1] + 1 (up, down)
  • delete: d[i-1]d[j] + 1 (right, left)

We can trace back the result from the bottom most right cell, as indicated by the bold numbers.

f → d

d x

b → z

Modification in WER on word level

their fresh new results
0 1 2 3 4
their 1 0 1 2 3
first 2 1 1 2 3
few 3 2 2 2 3
results 4 3 3 3 2

new → few

fresh → first

what extent
0 1 2
what's 1 1 2
the 2 2 2
extent 3 3 2

reference: this youtube video has nice explanation :

python code for WER and alignment: