Navigation auf uzh.ch

Suche

Department of Computational Linguistics Text Technologies

Data

Here I would like to share with you data that is free to use to replicate experiments I have done on my own or together with my colleagues. Please mention the use of the data appropriately if it leads to a publication of yours 😊.

Please also have a look at my repos:

zenodo logo
huggingface logo
 
 
github logo

 

Ground truth for Neue Zürcher Zeitung black letter period

NZZ title page

The Neue Zürcher Zeitung (NZZ) had been publishing in black letter from its very first issue in 1780 until 1947. From this time period, we randomly sampled one front page per year, resulting in a total of 167 pages. We chose front pages because they typically contain highly relevant material and because we wanted to make sure not to sample pages containing exclusively advertisements or stock information. During certain periods, the NZZ was published several times a day, and there were supplements, too. Due to incomplete metadata, the sample includes front pages from supplements.

We then manually corrected the text of these 167 pages. The so-created ground truth can serve several purposes, but was initially intended to improve OCR.

The ground truth can be downloaded from several sources:

  • GitHub
  • DOI

The Gwalther Handwriting Ground Truth

This is ground truth for Ruolph Gwalther's (1519-1586) handwriting taken from his book "Lateinische" Gedichte", where he accumulated writings between 1540 and 1580.

Data collection and ground truth creation:

At the time we collected the data, we found 150 images with corresponding transcriptions by Peter Stotz on e-manuscripta (reference: Gwalther, Rudolf: Lateinische Gedichte. Zürich, 1540-1580. Zentralbibliothek Zürich, Ms D 152, https://doi.org/10.7891/e-manuscripta-26750 / Public Domain Mark) . We removed 8 images with too many corrections or vertical texts. Next, we uploaded the images into the Transkribus platform, applied the line recognition tool and manually copied the transcribed text lines into the recognised line boxes. During this process, we made some corrections, which were mainly due to inconsistencies in punctuation and capitalised letters.

  • GitHub Logo
  • Zenodo Badge

     

Bullinger Correspondence HTR Data

This dataset contains 165,673 image and corresponding text line files (.png for images and .txt for the texts) in a random 80/10/10 training, validation and test set split. The source is the extensive correspondence of Swiss reformer Heinrich Bullinger (1504-1575) and his over 800 different correspondents. It therefore contains great variety in handwriting styles. Furthermore, it is multilingual since there are Latin and Early New High German (and sometimes mixed) letters. The data is split into Latin and Early New High German (determined with langid) and put into separate folders (de for Early New High German and la for Latin).

  • GitHub
  • DOI