The "ICFHR2016 Competition on Handwritten Text Recognition on the READ Dataset" competition is organized in the framework of the ICFHR 2016 competitions by the Pattern Recognition and Human Language Technologies research centre with the collaboration of the READ partners. This contest aims to bring together researchers working on off-line Handwritten Text Recognition (HTR) and provide them a suitable benchmark to compare their techniques on the task of transcribing typical historical handwritten documents. Previous editions of this contest were organized at the ICFHR 2014 (Sánchez, 2014) and at the ICDAR 2015 (Sánchez, 2015).
The proposed dataset consists of a subset of documents from the Ratsprotokolle collection composed of minutes of the council meetings held from 1470 to 1805 (about 30.000 pages), which will be used in the READ project. This dataset is written in Early Modern German. The number of writers is unknown. Handwriting in this collection is complex enough to challenge the HTR software.
The dataset for this competition is composed of 450 pages; most of the pages consist of a single block with many difficulties for line detection and extraction (see page samples below). The dataset is divided into 2 batches for the competition: 1 batch for training and 1 batch for testing.
The first batch is composed of 400 pages. The ground-truth in this set is in PAGE format (Pletschacher, 2010) and it will be provided annotated at line level in the PAGE files. For making easier the participation, several tools will be provided for extracting the lines as we describe below. Training data will be provided from March, 1st.
The second batch is a test set of 50 pages that will be kept hidden and released in due time just to obtain the results to be evaluated and compared.
Description and goals
The systems entering this contest should try to obtain the most accurate recognition results in the test partition.
The available data for the first batch will consist of:
- The original images of all the training pages
- The PAGE file corresponding to each page image. For each text line in this image, the PAGE file contains a baseline and an automatically obtained bounding polygon (Romero, 2015), and the corresponding diplomatic transcript. All baselines have been checked and corrected manually.
A series of tools will be provided for extracting the lines from the polygon information and the corresponding transcript.
The test images, with the transcript fields empty, will be eventually provided in the same format as first batch for evaluation purposes (see schedule below).
A baseline system based on HTK hidden Markov models and SRILM language modelling will be provided, including a set of scripts to perform a basic training and test experiment. The participants can use this baseline system as an initial approach to their own systems, where they will be allowed to improve this baseline by changing one or several of the following steps:
- page-level pre-processing and line extraction
- line pre-processing and normalization
- feature extraction
- recognition system and/or approach
- types of character, lexical and/or language models
Several submissions per participant will be allowed and all the results will be considered when presenting the competition results. Regarding the tokenization, the transcripts in each submission have to be as similar as possible to the training data. In each submission, the participant must provide a brief description of the characteristics of the submitted system, emphasizing the main characteristics of the submitted system. The final goal is to analyze the different proposals of the participants.
The evaluation will be performed on the transcription results provided by each recognition system. The evaluation metric will be a linear combination of the Word Error Rate (WER) and the Character Error Rate (CER) (50% each) between the reference transcript and the transcript provided by the system from each line. The winner will be the system which obtains the least value of the linear combination on the test set. A web-based platform will be available for the participants to submit their test results.
Two tracks are planned in this competition:
- Restricted track: in this track the participants can use only the data provided by the organizers for training and tuning their systems
- Unrestricted track: in this track the participants can use any data of their choice
The baseline system will be prepared only for the restricted track. It is mandatory that the entrants participating in the "Unrestricted track" participate in the "Restricted track". The idea of this obligation is to be able to compare several systems in analogous training conditions.
Registration and access to dataTo register in this contest send an e-mail to jandreu_AT_prhlt_DOT_upv_DOT_es with the subject ICFHR 2016 HTR competition registration. In the message you must provide the following data:
- Group name and acronym
- Participants and e-mail
- Contact person
Data now available after being registered
Baseline system available!!
- RWTH Aachen University, Germany
- Telecom ParisTech (ENST), France and University of Balamand (UOB), Lebanon
- LITIS, France
- Q2B, Sweden
- Computational Intelligence Laboratory (CILAB), China
- BYU Computer Science Department, USA
- A2ia, France
- 3@CNU Chonnam National University, South Korea
- Digitalizzazione di Archivi, BIblioteche e MUSei - D.A.BI.MUS.
- 1 March 2016 Competition opens, start of inscription period, training data available, baseline system available.
- 31 May 2016 Registration deadline (no more participants would be admitted).
- 12 June 2016 Test data available
- 24 June 2016 Deadline for systems results
- 26 June 2015 Deadline for sending short description of the submitted systems
- Oct 23-26, 2016 Winners and final ranking of all teams will be made public at the ICFHR 2016 conference.
- Dr. Joan Andreu Sánchez, Pattern Recognition and Human Language Technologies
- Dr. Verónica Romero, Pattern Recognition and Human Language Technologies
- Dr. Alejandro H. Toselli, Pattern Recognition and Human Language Technologies
- Dr. Enrique Vidal, Pattern Recognition and Human Language Technologies
- (Pletschacher, 2010) S. Pletschacher and A. Antonacopoulos, "The PAGE (page analysis and ground-truth elements) format framework," in Proc. ICPR, 2010, pp. 257-260.
- (Romero, 2015) V. Romero, J. A. Sánchez, V. Bosch, K. Depuydt, and J. de Does, "Influence of text line segmentation in handwritten text recognition," in 13th international conference on document analysis and recognition (ICDAR), 2015.
- (Sánchez, 2014) J. A. Sánchez, V. Romero, A. H. Toselli, and E. Vidal, "ICFHR2014 competition on handwritten text recognition on tranScriptorium datasets (HTRtS)," in Proc. ICFHR, 2014, pp. 181-186.
- (Sánchez, 2015) J. A. Sánchez, A. H. Toselli, V. Romero, and E. Vidal, "ICDAR 2015 competition HTRtS: handwritten text recognition on the tranScriptorium dataset," in 13th international conference on document analysis and recognition (ICDAR), 2015, pp. 1166-1170.