Abstract

Addressing the goal of the workshop, i.e. to bridge the gap between academic and industrial aspects in regard to scholarly data, we inspect the case of plant phenotyping data publishing. We discuss how the publishers could foster advancements in the field of plant research and data analysis methods by warranting good quality phenotypic data with foreseeable semantics.

Examining of a set of scientific journals dealing with life sciences for their policy with respect to plant phenotyping data publication shows that this type of resource seems largely overlooked by the data policy-makers. Current lack of recognition, and resulting lack of recommended repositories for plant phenotypic data, leads to depreciation of such datasets and its dispersion within general-purpose, unstructured data storages. No clear incentive for individual researchers to follow data description and deposition guidelines makes it difficult to develop and promote new approaches and tools utilising public phenotypic data resources.

Research data publication guidelines

Having selected a non-exhaustive set of popular life science journals where plant research involving phenotypic data has been published or submitted within the previous months, we have examined their current data policy with respect to plant phenotyping datasets. A summary of the guidelines covering plant phenotypic data is shown in .

Journal Title	Data Category covering plant phenotypic data	Suggested data location	Suggested reporting standards
Plant Journal	other	PR; SI; IR/AuR
New Phytologist	n/a	SI; (PR)
Journal of Applied Genetics	other	PR: Figshare, Dryad
Euphytica	other	PR: Figshare, Dryad; (SI)
Scientific Data	other	PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo
Plant Methods	other	PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo; IR	field-specific standards reported at BioSharing website
Theoretical and Experimental Plant Physiology	n/a	n/a	"By data we mean the minimal dataset that would be necessary to interpret, replicate and build upon the findings reported in the article"
Genetic Resources and Crop Evolution	other	PR: Figshare, Dryad; SI
GigaScience	plant; other	PGP; GigaDB, FigShare, Dryad, Zenodo; IR	minimum reporting guidelines at FAIRsharing Portal
Integrative Biology	other	PR: Figshare, Dryad; IR, SI	MIBBI checklists for reporting biological and biomedical research
BMC Plant Biology	other	PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo
Plant Science	life science; other	PR: Dryad; Mendeley Data, Data in Brief
Frontiers in Plant Science	other; traits	PR: Dryad; TRY
Genetics	other	PR: Figshare, Dryad; (SI)
Plos One	other	PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo	prescriptive checklists for reporting biological and biomedical research at FAIRsharing Portal
Journal of Experimental Botany	other	PR: Dryad, Figshare or Zenodo
The Plant Journal	other	PR: Dryad, Figshare or Zenodo; (SI)	"Reporting standards for large-scale omics datasets are constantly evolving and TPJ will follow common sense standards currently accepted in the field"
Nature Plants	other	PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo	Life Sciences Reporting Summary (internal of Nature Publishing Group)
Nature Genetics	other	PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo	Life Sciences Reporting Summary (internal of Nature Publishing Group)
Scientific Reports	other	PR: Dryad
Heredity	other	PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo	Life Sciences Reporting Summary (internal of Nature Publishing Group)

A summary of data policy of selected life science journals towards plant phenotyping data.
PR = Public Repository; SI = Supplementary Information; IR = Institution Repository; AuR = Available upon Request;
() = if need be; n/a - information not available
Finding definitive information about data policies was not straightforward due to the information being frequently scattered across journals’ and publishers’ websites. Multiple declarations about journal-specific data policies, those of the publisher, supported or adopted external initiatives, recommended data repositories, reporting standards, and data archiving policies appear at different locations. Having this information published in a standardised form and place would greatly improve the access and comprehensibility.

Most of the journals declare adoption of some research open data and data archiving policy, and 'data availability statement' is required in a paper. With minor exceptions of confidential data, 'data not shown' statements are no longer accepted; instead, the research data underlying all claims and conclusions must be provided. If the data cannot be included in the paper itself, its deposition in public repositories is suggested. Institutional or private repositories are rarely allowed. In some cases attaching a small dataset as size-constrained supplementary information to the online article is also accepted.

Only a few specific data types are governed and validated against domain-specific metadata standards or reporting guidelines. A set of required repositories are usually explicitly given for well established measurement types resulting in homogeneous data such as nucleotide sequences (DDBJ , ENA, GeneBank), amino acid sequences (UniProt/Swiss-Prot) and structures (PDB), microarrays (ArrayExpress), metabolites (MetaboLights) or chemical structures (ChemSpider, PubChem). More heterogeneous submissions of selected species are appointed to dedicated repositories, e.g. Arabidopsis genome functional annotation data (TAIR) or human genome-phenome interactions (dbGaP, EGA). Other data types are expected to be put into general-purpose repositories such as Dryad , figshare or Zenodo , or a more scientific GigaDB . For all but one journal this is the case of plant phenotyping data, not being explicitly mentioned in the data policies described on their websites.

The only publisher accounting for plant data covering phenotypes is GigaScience , which directs to a dedicated public repository of PGP . Another type-specific repository assigned to an ambiguous category of ‘trait data’ is the TRY database. Although the storage is indeed dedicated to a wide range of plant trait information, its purpose advertised in is not being a running repository for experimental datasets, but an analytical processing database for environmental compilations.

Some journals direct an ambitious author to a policy register FAIRsharing (formerly MIBBI and BioSharing, still referenced to as such) to identify and follow a suitable standard on their own. Some other links provided by the journals, lead to DataCite’s re3data.org repository catalogue that lists institutional, project- or community-specific resources, where also phenotyping databases of different scope and governed by diverse requirements can be found.

Alas, dataset publishing alone does not ensure sufficient data comprehensibility or reproducibility. No systematic validation of whether the dataset complies with recognised standards seems to be done by the journals. General-purpose storages, contrary to the explicitly recommended dedicated databases, do not impose any constraints on the content and quality of the submissions. Individual repositories for phenotypic data might be governed by provisional guidelines appropriate for current applications rather than long terms storage with the view of replicability and interoperability. Lack of journals' consistent data validation policy poses a threat of missing out some important metadata qualities and thus losing a unique and potentially valuable contribution to the explanation of biological mechanisms, at the same time letting the storages fill with non-reusable datasets.

Plant phenotypic data

Data characteristics

Plant phenotypic data constitute one of the fundamental bricks of plant science. Phenotypic traits result from the interaction of organism’s genome and environment, and are manifested as organism's observable characteristics. Phenotypic traits are of different nature (qualitative or quantitative) and granularity (e.g cellular or whole plant properties). They can be determined by diverse techniques (e.g. visual observation, manual measurement, automated imaging and processing, or analysis of samples by sophisticated devices) in different time regimes (one-time capture, repeated or time-series) and expressed in different scales and units. The general notion of plant phenotyping encompasses many types of observations done at different levels of material granularity and technical complexity, and pertains to a huge fraction of plant research.

As mentioned before, many types of measurements that produce a homogeneous set of specific, in-depth observations are well-standardized; they have a dedicated standard for data description and formatting, as well as measurement-specific data repositories. Meanwhile, a wide range of “traditional” measurements done with a variety of non-standardised, non-automated or low-throughput methods are collected and analysed in laboratories every day. Frequently they apply to basic plant qualities like yield, fruit taste and colour, or disease resistance. Those basic phenotyping data are indispensable for interpretation of the in-depth "-omics" analyses by providing the ultimate observable result of the biological mechanisms under study.

In plant experiments, the relationships between the phenotypic qualities and environmental conditions are to be explained. Thus, a precise description of the experiment's environment is an integral part of all plant phenotyping datasets and makes each of them unique. The well-documented datasets are valuable not only due to specific environmental conditions that contribute to the observed phenotype, but also because they are frequently expensive and time-consuming to produce. Thus each experiment deserves scrupulous data handling.

Huge diversity of plant studies performed by researchers to investigate mechanisms of plant reactions to different stimuli makes it difficult to establish one fixed approach to handling the description of the process and its results. While standardising experimental procedures is undesirable, as it could restrict scientific findings and hinder innovative approaches and methodologies, the standardisation of experimental procedures' description is necessary. It should be done in a way that is both precise and flexible, so that it is possible to interact with datasets of different types in a similar way, and yet the standardisation does not impose constraints leading to losing details of non-standard aspects of particular research. In case of plant phenotyping, irrespective of the type of organisms, their treatments and measurements done, a set of common abstract properties of such experiments can be recognised. A number of common organism properties, environmental properties and foreseeable steps in the preparation of the plant material and its growth can be identified, named and described; and eventually used for data validation, searching, processing and analysis. Identification and providing of those general common experimental metadata is a step towards taming the broad field of plant phenotyping data.

Plant dataset can be used on their own by the researchers, breeders and farmers to screen plant varieties and identify genotypes that behave in a special way in certain conditions. Together with genomic marker data, phenotypes make a crucial part of QTL or GWAS analyses, and in conjunction with environmental data serve in genome-environment interaction analysis. The lack of explicit consideration (in recommendations, validation process, encouragement to publish, and access to databases or keywords) for this non-specific yet important data type, even in biological journals, makes it dissipate in the general repositories and limits the possible biological research findings.

Plant phenotyping community

In the community of plant researchers dealing with phenotyping data, the work on new improved approaches to plant experimental data description, modelling and processing is ongoing. Numerous project- or organism-specific databases and tools exist, and many solutions are successfully implemented to describe and store plant phenotyping datasets at individual institutions, organisations, and companies. Care is taken of appropriate data description by requesting from users to use specific data submission forms, models or ontologies according to internal standards.

Lately, the work of many groups in plant phenotyping community is aimed at facilitating plant phenotypic data exchange and reuse through improvement of experimental data standardisation and description guidelines.

Proposed solutions and ongoing efforts

In the absence of common standards addressing the description of plant phenotyping experiments, the Minimum Information About a Plant Phenotyping Experiment (MIAPPE) recommendations have been proposed in by a joint effort of big projects dedicated to plant infrastructure development: transPLANT , EPPN , and Elixir-Excelerate . MIAPPE constitutes a checklist of general properties needed to describe a plant experiment, so that the observed phenotypes can be interpreted as a result of the interaction of a genome and environment. To enable standardisation of the wording used in the description of all elements specified in MIAPPE, a set of ontologies and taxonomies for their semantic annotation has been recommended. Specific vocabularies to use with environment characterisation are being collaboratively developed by phenotyping community projects (EPPN2020 , EMPHASIS ).

The MIAPPE recommendations are implemented in a number of databases (PGP, EphesIS , PlantPhenoDB and GWA-Portal ). There is an ongoing work on harmonising implementations of Breeding API (a standard interface for plant phenotype/genotype databases) with MIAPPE. A semantic representation of MIAPPE-compliant plant phenotyping experiment model is being developed. A reference implementation for flat file data exchange has been proposed in ISA-Tab format .

On top of the phenotyping dataset description standard, a number of data processing, validation, quality assessment and statistical analysis tools exist or are being constructed, e.g., MIAPPE-based configuration to use with ISA-Tools for dataset creation and validation, data parsers in online tools provided by COPO infrastructure, or individual institutions' data processing tools for exporting datasets as flat files from their systems. Common semantics has proved beneficial for the implementation of distributed search tools for plant data resources, like transPLANT search or WheatIS search . As a follow-up, applications taking advantage of the enhanced data semantics will be built, conceivably offering advanced and innovative approaches for data exploration and analysis.

Limitations

The above-mentioned solutions proposed by the plant phenotyping community work towards establishing best practices in data management across phenotyping facilities. Although the Minimum Information approach might not be sufficient to cover all aspects of specific research types, it helps ensure the presence of the basic common properties in the description of heterogeneous phenotyping studies. Ideally, MIAPPE used as a checklist while preparing a dataset should provoke the researchers to think of documenting all related, meaningful aspects of the experiment. The approach could become even more relevant if the individual researchers, also those with small amounts of phenotypic observations possibly linked to other data types, were incentivised to provide good quality phenotyping datasets to repositories and papers.

Are there enough reasons for publishing research data? As shown by Piwowar et al. in for microarray clinical trial publications, there is a correlation between public availability of the research data and the number of paper citations, which should work as a natural motivation for both the researchers and the publishers. Moreover, citations of the dataset itself and its reuse, leading to the increase of author's indices or altmetrics statistics could be another measurable benefit for the researcher. Finally, availability of datasets with good quality metadata should enable conducting new interesting research involving integrative and comparative studies or meta-analyses. Additionally, it is bound to result in the development of data-driven applications for managing and analysis of datasets to facilitate researchers' work.

Unfortunately those benefits might be intangible for many plant phenotypic data producers, not having their research field recognized by the journals, and thus discouraged from publishing such data and its reuse by perceived unsure reliability. The potential benefits might be also unconvincing compared to the amount of work necessary from researchers to individually look for and follow the metadata guidelines for the field.

It appears that without journals' putting more stress on the publication of complete and well-described phenotypic datasets, the work on harmonising phenotyping resources and tools taking advantage of data semantics will be restricted only to huge players, i.e. research institutions administering own permanent storage systems and implementing own solutions. Meanwhile, smaller scientific units and researchers unable or not motivated to publish the phenotyping observations together with the standardised and adequate metadata will keep producing single-use-by-author-only datasets.

Dryad repository: https://www.datadryad.org

Figshare repository: https://figshare.com/

Zenodo repository: https://zenodo.org/

GigiDB repository: http://gigadb.org

GigaScience journal: https://academic.oup.com/gigascience

PGP repository: https://edal.ipk-gatersleben.de/repos/pgp/

TRY database: https://www.try-db.org/TryWeb/Home.php

FAIRsharing standard registry: https://fairsharing.org

DataCite organisation: https://www.datacite.org

Re3data.org repository registry: https://www.re3data.org

MIAPPE website : http://miappe.org

transPLANT project: http://www.transplantdb.eu

EPPN project: http://www.plant-phenotyping-network.eu

Elixir-Excelerate project: https://www.elixir-europe.org/about-us/how-funded/eu-projects/excelerate

EPPN2020 project: https://eppn2020.plant-phenotyping.eu

EMPHASIS infrastructure: https://emphasis.plant-phenotyping.eu

EphesIS database: https://urgi.versailles.inra.fr/ephesis

PlantPhenoDB database: http://cropnet.pl/plantphenodb/

GWA-Portal: https://gwas.gmi.oeaw.ac.at/

Breeding API project: https://brapi.org

Plant Phenotyping Experiment Ontology: http://agroportal.lirmm.fr/ontologies/PPEO

ISA-Tab format: http://isa-tools.org/format/specification/

ISA-Tools software suite: http://isa-tools.org/software-suite/

COPO project: https://copo-project.org

TransPLANT distributed search: http://www.transplantdb.eu/search/transPLANT/

WheatIS distributed search: https://urgi.versailles.inra.fr/wheatis/

Abstract

Background

Research data publication guidelines

Plant phenotypic data

Data characteristics

Plant phenotyping community

Proposed solutions and ongoing efforts

Limitations

Conclusions

Acknowledgements

References