The "ICDAR2015 Competition HTRtS: Handwritten Text Recognition on the tranScriptorium Dataset" competition is organised in the framework of the ICDAR 2015 competitions by the Pattern Recognition and Human Language Technologies research centre with the collaboration of the tranScriptorium partners. This contest aims to bring together researchers working on off-line Handwritten Text Recognition (HTR) and provide them a suitable benchmark to compare their techniques on the task of transcribing typical historical handwritten documents. The first edition of this contest HTRtS2014 was organised in the ICFHR 2014 (Sánchez, 2104).
The proposed dataset consists of a series of documents from the Bentham collection, which has been prepared in the tranScriptorium project. This dataset includes manuscripts written by Jeremy Bentham (1748-1832) himself over a period of sixty years, as well as fair copies written by Bentham's secretarial staff. Handwriting in this collection is complex enough to challenge the HTR software: manuscripts written by secretarial staff will provide variety, while Bentham's manuscripts are often complicated by deletions, marginalia, interlineal additions and other features (Gatos, 2014). The data used in this contest is closely related to the data used in the ICDAR2015 Competition on Keyword Spotting for Handwritten Documents.
The dataset for this competition is composed of 796 pages; most of the pages consist of a single block with many difficulties for line detection and extraction (see page samples below). The dataset is divided into 3 batches for the competition: 2 batches for training and 1 batch for test. The number of writers is unknown.
The first batch is composed of 433 pages. This set was used in the HTRtS2014 contest. The ground-truth in this set is in PAGE format (Pletschacher, 2010) and it will be provided annotated at line level in the PAGE files. For making easier the participation, the data will be provided in several formats as we describe below.
The second batch is composed by 313 pages. The ground-truth in this batch is in PAGE format but it will be provided annotated at text block level. The line transcripts for the blocks will be provided in a separated file, with a newline character at the end of each line. The idea of this second batch is that the entrants in the contest try to use this training set with their own methods. This training set simulates a real situation in which, sometimes, there exist transcription for some collection but the lines in the images are not annotated in correspondance with the transcripts.
Training data will be provided as soon as the competition becomes open.
The third bacth is a test set of 50 pages that will be kept hidden and released in due time just to obtain the results to be evaluated and compared.
Description and goals
The systems entering this contest should try to obtain the most accurate recognition results in the test partition.
The available data for the first batch will consist of:
- The original images of all the training pages
- The PAGE file corresponding to each page image. For each text line in this image, the PAGE file contains a bounding polygon and the corresponding correct transcript.
- The preprocessed and extracted line images for all the lines of the training and validation sets in grayscale (see examples below)
- A sequence of feature vectors for each line processed according to (Kozielski, 2013)
- The corresponding transcripts of each of these lines
Items 1 and 2 are redundant with items 3 and 5 and are provided for those who wish to try improving results by using specific image preprocessing and line extraction tools. Item 4 is provided for those who do not wish to try improving results at pre-procesing and feature extraction level.
The available data for the second batch will consist of:
- The original images of all the training pages
- The PAGE file corresponding to each page image. The PAGE file contains the bounding polygon for the text regions, not for the line regions
- For the text regions, a separated file with the corresponding correct transcripts will be provided
The test images, with the transcript fields empty, will be eventually provided in the same (redundant) formats as first batch for evaluation purposes (see schedule below).
A baseline system based on HTK hidden Markov models and SRILM language modelling will be provided, including a set of scripts to perform a basic training and test experiment (using the first batch). The participants can use this baseline system as an initial approach to their own systems, where they will be allowed to improve this baseline by changing one or several of the following steps:
- page-level pre-processing and line extraction
- line pre-processing and normalisation
- feature extraction
- recognition system and/or approach
- types of character, lexical and/or language models
Several submissions per participant will be allowed and all the results will be considered when presenting the competition results. In each submission, the participant must provide a brief description of the characteristics of the submitted system, emphasising the main characteristics of the submitted system. The final goal is to analyse the different proposals of the participants.
The evaluation will be performed on the transcription results provided by each recognition system. The evaluation metric will be the Word Error Rate (WER) between the reference transcript and the transcript provided by the system from each line. The winner will be the system which obtains the least WER on the test set. A web-based platform will be available for the participants to check their test results.
Two tracks are planned in this competition:
- Restricted track: in this track the participants can use only the data provided by the organisers for training and tuning their systems
- Unrestricted track: in this track the participants can use any data of their choice
The baseline system will be prepared only for the restricted track. It is mandatory that the entrants participating in the "Unrestricted track" participate in the "Restricted track". The idea of this obligation is to be able to compare several systems in analogous training conditions.
Registration and access to dataTo register in this contest send an e-mail to jandreu_AT_prhlt_DOT_upv_DOT_es with the subject ICDAR 2015 HTRtS competition registration. In the message you must provide the following data:
- Group name and acronym
- Participants and e-mail
- Contact person
- 19 Jan 2015 Competition opens, start of inscription period, training data available, baseline system available.
- 31 March 2015 Registration deadline (no more participants would be admitted).
- 31 March 2015 Test data available
- 7 Apr 2015 Deadline for systems results
- 15 Apr 2015 Deadline for sending short description of the submitted systems
- Dr. Joan Andreu Sánchez, Pattern Recognition and Human Language Technologies
- Dr. Verónica Romero, Pattern Recognition and Human Language Technologies
- Dr. Alejandro H. Toselli, Pattern Recognition and Human Language Technologies
- Dr. Enrique Vidal, Pattern Recognition and Human Language Technologies
- (Pletschacher, 2010) S. Pletschacher and A. Antonacopoulos, "The PAGE (page analysis and ground-truth elements) format framework," in Proc. ICPR, 2010, pp. 257-260.
- (Kozielski, 2013) M. Kozielski, P. Doetsch, and H. Ney. "Improvements in RWTH's system for off-line handwriting recognition," in Proc. ICDAR, 2013, pp. 935-939.
- (Gatos, 2014) B. Gatos, G. Louloudis, T. Causer, K. Grint, V. Romero, J. A. Sánchez, A. H. Toselli, and E. Vidal, "Ground-truth production in the transcriptorium project," in Proc. DAS, 2014, pp. 237-241.
- (Sánchez, 2014) J. A. Sánchez, V. Romero, A. H. Toselli, and E. Vidal, "ICFHR2014 competition on handwritten text recognition on transcriptorium datasets (HTRtS)," in Proc. ICFHR, 2014, pp. 181-186.