The "ICFHR2016 Competition on Handwritten Text Recognition on the READ Dataset" competition is organized in the framework of the ICFHR 2016 competitions by the Pattern Recognition and Human Language Technologies research centre with the collaboration of the READ partners. This contest aims to bring together researchers working on off-line Handwritten Text Recognition (HTR) and provide them a suitable benchmark to compare their techniques on the task of transcribing typical historical handwritten documents. Previous editions of this contest were organized at the ICFHR 2014 (Sánchez, 2014) and at the ICDAR 2015 (Sánchez, 2015).

The proposed dataset consists of a subset of documents from the Ratsprotokolle collection composed of minutes of the council meetings held from 1470 to 1805 (about 30.000 pages), which will be used in the READ project. This dataset is written in Early Modern German. The number of writers is unknown. Handwriting in this collection is complex enough to challenge the HTR software.

The dataset for this competition is composed of 450 pages; most of the pages consist of a single block with many difficulties for line detection and extraction (see page samples below). The dataset is divided into 2 batches for the competition: 1 batch for training and 1 batch for testing.

The first batch is composed of 400 pages. The ground-truth in this set is in PAGE format (Pletschacher, 2010) and it will be provided annotated at line level in the PAGE files. For making easier the participation, several tools will be provided for extracting the lines as we describe below. Training data will be provided from March, 1st.

The second batch is a test set of 50 pages that will be kept hidden and released in due time just to obtain the results to be evaluated and compared.

Description and goals

The systems entering this contest should try to obtain the most accurate recognition results in the test partition.

The available data for the first batch will consist of:

  1. The original images of all the training pages
  2. The PAGE file corresponding to each page image. For each text line in this image, the PAGE file contains a baseline and an automatically obtained bounding polygon (Romero, 2015), and the corresponding diplomatic transcript. All baselines have been checked and corrected manually.

A series of tools will be provided for extracting the lines from the polygon information and the corresponding transcript.

The test images, with the transcript fields empty, will be eventually provided in the same format as first batch for evaluation purposes (see schedule below).

A baseline system based on HTK hidden Markov models and SRILM language modelling will be provided, including a set of scripts to perform a basic training and test experiment. The participants can use this baseline system as an initial approach to their own systems, where they will be allowed to improve this baseline by changing one or several of the following steps:

Several submissions per participant will be allowed and all the results will be considered when presenting the competition results. Regarding the tokenization, the transcripts in each submission have to be as similar as possible to the training data. In each submission, the participant must provide a brief description of the characteristics of the submitted system, emphasizing the main characteristics of the submitted system. The final goal is to analyze the different proposals of the participants.

Evaluation modalities

The evaluation will be performed on the transcription results provided by each recognition system. The evaluation metric will be a linear combination of the Word Error Rate (WER) and the Character Error Rate (CER) (50% each) between the reference transcript and the transcript provided by the system from each line. The winner will be the system which obtains the least value of the linear combination on the test set. A web-based platform will be available for the participants to submit their test results.

Two tracks are planned in this competition:

The baseline system will be prepared only for the restricted track. It is mandatory that the entrants participating in the "Unrestricted track" participate in the "Restricted track". The idea of this obligation is to be able to compare several systems in analogous training conditions.

Registration and access to data

To register in this contest send an e-mail to jandreu_AT_prhlt_DOT_upv_DOT_es with the subject ICFHR 2016 HTR competition registration. In the message you must provide the following data: A username and password will be given to each registered participant, which will grant access to the data and evaluation page.

Data now available after being registered

Registered participants

  1. RWTH Aachen University, Germany
  2. Telecom ParisTech (ENST), France and University of Balamand (UOB), Lebanon
  3. LITIS, France
  4. Q2B, Sweden
  5. Computational Intelligence Laboratory (CILAB), China
  6. BYU Computer Science Department, USA
  7. A2ia, France
  8. 3@CNU Chonnam National University, South Korea
  9. Digitalizzazione di Archivi, BIblioteche e MUSei - D.A.BI.MUS.