Taking the data out of paper
Iliyana Kuzmova
Chenopodium vulvaria, species used in one of the pilots, image source: Woodville, William "Medical botany" online at:

Ecological modellers require reliable sources of data for their analysis. Often, these sources are databases, checklists and specimen labels. Yet another rich source is the corpus of biological literature. It is estimated that there are well over 100 million pages of scientific publications and the volume grows every year. Publishing in advanced XML-based journals, such as Zookeys, Phytokeys or the Biodiversity Data Journal is recommended for new data, but what is the solution for legacy texts?

The EU FP7 project pro-iBiosphere has been piloting the mark-up and extraction of biological information from literature, which has been pioneered by Plazi (Agosti & Egloff, 2009). The EU FP7 Coordination and Support Action "pro-iBiosphere" was launched to investigate ways to increase the accessibility of biodiversity data, improve the efficiency of its curation and increase the user base of biodiversity data consumers and applications. The project addresses the technical and semantic interoperability between different forms in which data are published and analyses the sustainability issues related to the maintenance and curation of biodiversity data and derived information and knowledge. It also involves encouraging the biodiversity community to publish biodiversity data in a way that satisfies the technical requirements for an envisioned Open Biodiversity Knowledge Management System.

In order to reach these objectives three pilots for data mark-up and one for interoperability are being conducted (for detailed information on the pilots please see here). The mark-up pilots are evaluating accessibility of data within literature for a wide range of organisms and data types; and ways to facilitate  extraction of biological information from literature, including observations, traits, nomenclature, habitat information and interactions between organisms. For example, one pilot is looking at biogeographic data using the species Chenopodium vulvaria as a subject. In another, trait data is being extracted from literature on tropical mistletoes; while yet others are extracting data from papers on spiders, ants, centipedes, mosses and fungi.

In order to extract these data one can use either "born" digital texts or scanned texts, converted through text capture. These texts are then progressively marked up into XML documents, with tags defining the meaning of the containing text. The degree of mark-up granularity and the choice of textual elements to be marked-up depend on the type of data to be extracted and its granularity in the text. In taxonomically based literature, text is usually divided into the individual "treatments" for each species. Fortunately, most paragraph elements of these texts are in standard formats, for example, separate blocks of text contain the physical description of the organism, details of the distribution and habitat information, often separated with sub-headings.

The pro-iBiosphere pilots have used several methods for mark-up, but the main tool has been the GoldenGate Editor, which combines manual and automated methods to identify key text elements. For example, an algorithm identifies Latin names and then an interface guides the user through the verification of the algorithm’s results. Once marked-up, the XML document can be uploaded to the Plazi document repository. Plazi is a not-for-profit organization devoted to promoting open-access to taxonomic literature. You are free to use the data contained in Plazi’s repository and if you want you can refine the mark-up for your own purposes.

Extracting data from the legacy literature can be expensive. Modern XML based publications have additional advantages of linkages via DOI identifiers, and immediate dissemination to harvesters like EOL or GBIF. Yet, digitisation and mark-up has the possibility to reanimate the data in our publications, making them almost as useful as modern linked publications.

Task 3.4 of EU-BON is to develop tools to prepare, extract and mine published biodiversity literature (led by Plazi - Donat Agosti). For this task Plazi is looking for rich sources of data from the biodiversity literature, particularly where those data can be applied within other EU-BON tasks. For further information please contact Plazi

Agosti, D., & Egloff, W. (2009). Taxonomic information exchange and copyright: the Plazi approach. BMC research notes, 2(1), 53. doi:10.1186/1756-0500-2-53

Quentin Groom (National Botanic Garden, Belgium) & Donat Agosti (Plazi)

Members area

Lost your password?
flag big This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 308454.