Untitled

The task of CTC is to arrange and predict the output characters from the results of sequence recognition using LSTM, Transformer, etc.

Untitled

CTC is a simple loss function used for training deep learning models. Its goal is to find the alignment between an input X and an output Y. It does not require aligned data, as it can provide the probability for each possible alignment from X to Y. It only requires an input image (feature matrix of the image) and the corresponding text.

How it works?

Step 1: Encoding Text

Handling Duplicate Characters: CTC merges consecutive duplicate characters into one. For example, "ffoooccusss" becomes "focus" Managing Intrinsic Duplicates: To address words with inherent duplicate characters, CTC inserts a blank character ("-") between duplicates. For example, "meet -> mm-ee-ee-tmmm-e-ee-tt." During decoding, the blank indicates that both surrounding characters should be retained.

Step 2: Loss calculate

Untitled

Untitled

For example:

Alignment probability of a character: *aaa, a–, a-, aa-, -aa, –a. So the score of a is: a = 0.4x0.3x0.4 + 0.4x0.7x0.6 + 0.4x0.7 + 0.4x0.3x0.6 + 0.1x0.3x0.4 + 0.1x0.7x0.4 = 0.608. —>

$$ Loss = -log_{10}0.608 \approx 0.216 $$

Step 3: Decoding Text

Untitled

The decoding process for an unseen image occurs as follows: