The "ICFHR 2014 Handwritten Text Recognition on the tranScriptorium Dataset (HTRtS)" competition is organised in the framework of the ICFHR 2014 competitions by the Pattern Recognition and Human Language Technologies research centre. It aims to bring together researchers working on off-line Handwritten Text Recognition (HTR) and provide them a suitable benchmark to compare their techniques on the task of transcribing typical historical handwritten documents.
The proposed dataset consists of a series of documents from the Bentham collection, which have been prepared in the tranScriptorium project. This dataset includes manuscripts written by Jeremy Bentham (1748-1832) himself over a period of sixty years, as well as fair copies written by Bentham's secretarial staff. Handwriting in this collection is complex enough to challenge the HTR software: manuscripts written by secretarial staff will provide variety, while Bentham's manuscripts are often complicated by deletions, marginalia, interlineal additions and other features (Gatos, 2014).
The dataset for this competition is composed of 433 pages; most of the pages consist of a single block with many difficulties for line detection and extraction (see page samples below). The ground truth is in PAGE format (Pletschacher, 2010) and it will be provided annotated at line level in the PAGE files. For making easier the participation, the data will be provided in several formats as we describe below.
The dataset is divided for the competition in three different parts: training, validation, and test. The training part consists of about 9,200 lines, whereas the validation partition is about 1,400 lines. The test set will be used for evaluating the submitted systems. The number of writers in each partition is unknown. Training and validation will be provided as soon as the competition becomes open, while the test part will be kept hidden and released in due time just to obtain the results to be evaluated and compared.
Description and goals
The systems entering this contest should try to obtain the most accurate recognition results in the test partition. The available data will consist of:
- The original images of all the training and validation pages
- The PAGE file corresponding to each page image. For each text line in this image, the PAGE file contains a bounding polygon and the corresponding correct transcript.
- The preprocessed and extracted line images for all the lines of the training and validation sets in grayscale (see examples below)
- The corresponding transcripts of each of these lines
The first pair of items is redundant with the second and is provided for those who wish to try improving results by using specific image preprocessing and line extraction tools.
The test images, with the transcript fields empty, will be eventually provided in the same (redundant) formats for evaluation purposes (see schedule below).
A baseline system based on HTK hidden Markov modelling and SRILM language modelling will be provided, including a set of scripts to perform a basic training and test experiment (using the provided validation partition for testing). The participants can use this baseline system as an initial approach to their own systems, where they will be allowed to improve this baseline by changing one or several of the following steps:
- page-level pre-processing and line extraction
- line pre-processing and normalisation
- feature extraction
- recognition system and/or approach
- types of character and/or language models
Several submissions per participant will be allowed and all the results will be considered when presenting the competition results. In each submission, the participant must provide a brief description of the characteristics of the submitted system, emphasising the main differences between the submitted system and the baseline system. The final goal is to analyse the different proposals of the participants.
The evaluation will be performed on the transcription result provided by each recognition system. The evaluation metric will be the Word Error Rate (WER) between the reference transcript and the transcript provided by the system from each line. The winner will be the system which obtains the least WER on the test set. A web-based platform will be available for the participants to check their test results.
Two tracks are planned in this competition:
- Restricted track: in this track the participants can use only the data provided by the organisers for training and tuning their systems
- Unrestricted track: in this track the participants can use any data of their choice
The baseline system will be prepared only for the restricted track.
Registration and access to dataTo register in this contest send an e-mail to jandreu_AT_dsic_DOT_upv_DOT_es with the subject ICFHR 2014 HTRtS competition registration. In the message you must provide the following data:
- Group name and acronym
- Participants and e-mail
- Contact person
The dataset is currently available at tranScriptorium web page.
Best WER / CER of the submitted systems on the test set
|Restricted track||Unrestricted track|
|A2iA||8.6 / 2.9|
|CITlab||14.6 / 5.0|
|LIMSI||15.0 / 5.5||11.0 / 3.9|
More details about the competition, the submitted sytems and the results will be available at the ICFHR 2014 proceedings.
- 16 Feb 2014 Competition opens, start of inscription period, training and validation data available, baseline system available.
- 1 Apr 2014 Registration deadline (no more participants would be admitted).
- 7 Apr 2014 Test data available
- 14 Apr 2014 Deadline for systems results
- 19 Apr 2014 Deadline for sending short description of the submitted systems
- Dr. Joan Andreu Sánchez, Pattern Recognition and Human Language Technologies
- Dr. Verónica Romero, Pattern Recognition and Human Language Technologies
- Dr. Alejandro H. Toselli, Pattern Recognition and Human Language Technologies
- Dr. Enrique Vidal, Pattern Recognition and Human Language Technologies
- (Pletschacher, 2010) S. Pletschacher and A. Antonacopoulos, "The PAGE (page analysis and ground-truth elements) format framework," in Proc. ICPR, 2010, pp. 257-260.
- (Gatos, 2014) B. Gatos, G. Louloudis, T. Causer, K. Grint, V. Romero, J. A. Sánchez, A. H. Toselli, and E. Vidal, "Ground-truth production in the transcriptorium project," in 11th IAPR International Workshop on Document Analysis Systems (DAS) 2014.