Optical Character Recognition

Using Subgradient Methods Maximum Margin Structured Classification

Nathan Ratliff, J. Andrew Bagnell, Martin Zinkevich

One of the first applications of maximum margin structured classification (MMSC) studied in the seminal paper on the subject (Taskar et al., 2004) was optical character recognition (OCR). This approach leverages interrelationships among individual letters of a given word to improve classification accuracy on the entire sequence of handwritten characters. Taskar et al. proposed a simple sequential minimal optimization (SMO) algorithm for solving the MMSC quadratic program in the dual space. Unfortunately, this SMO variant, along with the collection of algorithms that followed it, all worked in the dual space making them prohibitively memory intensive on large data sets. Authors repeatedly reported limiting their experiments to relatively small subsets of the available data to accommodate this deficiency. In a set of experiments documented in (Ratliff et al., 2007c), we show that optimization routines within LEARCH successfully circumvent this problem.

The prediction accuracy cited by Taskar et al. for the OCR problem was computed using only 600 the original 6100 examples to train the system (Taskar et al., 2004). The linear scaling of LEARCH allowed us to train on a much larger subset of the data in our experiments without encountering memory problems. By training on a data set containing 5500 example, we attained a classification error of .13, which is about 35% lower than the error reported under SMO optimization. Additionally, this error is approximately the same as that reported for the more powerful quadratic and cubic kernels using the SMO algorithm.

We additionally used the OCR problem to demonstrate the importance of the structured margin. As a point of comparison, we implemented two additional gradient-based optimization procedures for this problem found in the literature: the Perceptron algorithm (Collins et al. 2002); and the subgradient method applied to an unstructured margin objective of the flavor presented in (LeCun et al., 1998). The above figure shows the results of our experiments in terms of both the per-character hamming loss and the full-word classification error. Each of ten validation runs are shown in the figures with the structured margin in green, the unstructured margin in red, and the Perceptron algorithm in blue. These results fit a very distinct pattern: the degree of overfitting increases as the complexity of the margin decreases.