Overview

The "ICFHR 2014 Handwritten Text Recognition on the tranScriptorium Dataset (HTRtS)" competition is organised in the framework of the ICFHR 2014 competitions by the Pattern Recognition and Human Language Technologies research centre. It aims to bring together researchers working on off-line Handwritten Text Recognition (HTR) and provide them a suitable benchmark to compare their techniques on the task of transcribing typical historical handwritten documents.

The proposed dataset consists of a series of documents from the Bentham collection, which have been prepared in the tranScriptorium project. This dataset includes manuscripts written by Jeremy Bentham (1748-1832) himself over a period of sixty years, as well as fair copies written by Bentham's secretarial staff. Handwriting in this collection is complex enough to challenge the HTR software: manuscripts written by secretarial staff will provide variety, while Bentham's manuscripts are often complicated by deletions, marginalia, interlineal additions and other features (Gatos, 2014).

The dataset for this competition is composed of 433 pages; most of the pages consist of a single block with many difficulties for line detection and extraction (see page samples below). The ground truth is in PAGE format (Pletschacher, 2010) and it will be provided annotated at line level in the PAGE files. For making easier the participation, the data will be provided in several formats as we describe below.

The dataset is divided for the competition in three different parts: training, validation, and test. The training part consists of about 9,200 lines, whereas the validation partition is about 1,400 lines. The test set will be used for evaluating the submitted systems. The number of writers in each partition is unknown. Training and validation will be provided as soon as the competition becomes open, while the test part will be kept hidden and released in due time just to obtain the results to be evaluated and compared.


Description and goals

The systems entering this contest should try to obtain the most accurate recognition results in the test partition. The available data will consist of:

The first pair of items is redundant with the second and is provided for those who wish to try improving results by using specific image preprocessing and line extraction tools.

The test images, with the transcript fields empty, will be eventually provided in the same (redundant) formats for evaluation purposes (see schedule below).

A baseline system based on HTK hidden Markov modelling and SRILM language modelling will be provided, including a set of scripts to perform a basic training and test experiment (using the provided validation partition for testing). The participants can use this baseline system as an initial approach to their own systems, where they will be allowed to improve this baseline by changing one or several of the following steps:

Several submissions per participant will be allowed and all the results will be considered when presenting the competition results. In each submission, the participant must provide a brief description of the characteristics of the submitted system, emphasising the main differences between the submitted system and the baseline system. The final goal is to analyse the different proposals of the participants.

Evaluation modalities

The evaluation will be performed on the transcription result provided by each recognition system. The evaluation metric will be the Word Error Rate (WER) between the reference transcript and the transcript provided by the system from each line. The winner will be the system which obtains the least WER on the test set. A web-based platform will be available for the participants to check their test results.

Two tracks are planned in this competition:

The baseline system will be prepared only for the restricted track.

Registration and access to data

To register in this contest send an e-mail to jandreu_AT_dsic_DOT_upv_DOT_es with the subject ICFHR 2014 HTRtS competition registration. In the message you must provide the following data: A username and password will be given to each registered participant, which will grant access to the data and evaluation page.

The dataset is currently available at tranScriptorium web page.

Registered participants

  1. A2IA
  2. CITlab
  3. LIMSI/A2IA
  4. DIVA
  5. I2R-NUS
  6. LI-RFAI
  7. IUPR

Results

Best WER / CER of the submitted systems on the test set


Restricted track Unrestricted track
A2iA 8.6 / 2.9
CITlab 14.6 / 5.0
LIMSI 15.0 / 5.5 11.0 / 3.9

More details about the competition, the submitted sytems and the results will be available at the ICFHR 2014 proceedings.

Schedule

Organisers

References