Abstract

Addressing the goal of the workshop, i.e. to bridge the gap between academic and industrial aspects in regard to scholarly data, we inspect the case of plant phenotyping data publishing. We discuss how the publishers could foster advancements in the field of plant research and data analysis methods by warranting good quality phenotypic data with foreseeable semantics.

Examining of a set of scientific journals dealing with life sciences for their policy with respect to plant phenotyping data publication shows that this type of resource seems largely overlooked by the data policy-makers. Current lack of recognition, and resulting lack of recommended repositories for plant phenotypic data, leads to depreciation of such datasets and its dispersion within general-purpose, unstructured data storages. No clear incentive for individual researchers to follow data description and deposition guidelines makes it difficult to develop and promote new approaches and tools utilising public phenotypic data resources. 

Background

Understanding plant phenotypes is the ultimate goal of plant research, motivated by the desire to improve plants with respect to such traits like yield, taste or disease resistance, for the benefit of mankind. To discover complex biological mechanisms behind certain phenotypic traits, a large number of elementary analyses must be done and eventually integrated. To allow this, sufficient amount of constitutive datasets with adequate experimental metadata must be available. 

Quality publications and dataset submissions should ensure that the published data is comprehensible, replicable and reusable. Validation against a set of established requirements, deposition and indexing of datasets should guarantee that good quality plant phenotyping datasets are accessible for both biological researchers and developers of new methodological approaches to its analysis.

Surprisingly, despite being a valuable and very basic type of resource that plant researchers deal with, plant phenotypic data tend to be overlooked by the publishers and their policy-makers. Among the journals examined for requirements and recommendations with respect to plant phenotyping datasets, only one names this kind of data explicitly and mentions a repository where phenotypic traits can be stored. As a result, the data get dispersed among general-purpose, unstructured repositories and lose its chance to stimulate new discoveries and dedicated methodology development.

Research data publication guidelines

Having selected a non-exhaustive set of popular life science journals where plant research involving phenotypic data has been published or submitted within the previous months, we have examined their current data policy with respect to plant phenotyping datasets. A summary of the guidelines covering plant phenotypic data is shown in  

Journal Title Data Category covering plant phenotypic data Suggested data location Suggested reporting standards

Plant Journal

other

PR; SI; IR/AuR

New Phytologist

n/a

SI; (PR)

Journal of Applied Genetics

other

PR: Figshare, Dryad

Euphytica

other

PR: Figshare, Dryad; (SI)

Scientific Data

other

PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo

Plant Methods

other

PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo; IR

field-specific standards reported at BioSharing website

Theoretical and Experimental Plant Physiology

n/a

n/a

"By data we mean the minimal dataset that would be necessary to interpret, replicate and build upon the findings reported in the article"

Genetic Resources and Crop Evolution

other

PR: Figshare, Dryad; SI

GigaScience

plant; other

PGP; GigaDB, FigShare, Dryad, Zenodo; IR

minimum reporting guidelines at FAIRsharing Portal

Integrative Biology

other

PR: Figshare, Dryad; IR, SI

MIBBI checklists for reporting biological and biomedical research

BMC Plant Biology

other

PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo

Plant Science

life science; other

PR: Dryad; Mendeley Data, Data in Brief

Frontiers in Plant Science

other; traits

PR: Dryad; TRY

Genetics

other

PR: Figshare, Dryad; (SI)

Plos One

other

PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo

prescriptive checklists for reporting biological and biomedical research at FAIRsharing Portal

Journal of Experimental Botany

other

PR: Dryad, Figshare or Zenodo

The Plant Journal

other

PR: Dryad, Figshare or Zenodo; (SI)

"Reporting standards for large-scale omics datasets are constantly evolving and TPJ will follow common sense standards currently accepted in the field"

Nature Plants

other

PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo

Life Sciences Reporting Summary

(internal of Nature Publishing Group)

Nature Genetics

other

PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo

Life Sciences Reporting Summary

(internal of Nature Publishing Group)

Scientific Reports

other

PR: Dryad

Heredity

other

PR: Dryad, Figshare, Harvard Dataverse, Open Science Framework, Zenodo

Life Sciences Reporting Summary

(internal of Nature Publishing Group)

A summary of data policy of selected life science journals towards plant phenotyping data.
PR = Public Repository; SI = Supplementary Information; IR = Institution Repository; AuR = Available upon Request;
() = if need be; n/a - information not available
Finding definitive information about data policies was not straightforward due to the information being frequently scattered across journals’ and publishers’ websites. Multiple declarations about journal-specific data policies, those of the publisher, supported or adopted external initiatives, recommended data repositories, reporting standards, and data archiving policies appear at different locations. Having this information published in a standardised form and place would greatly improve the access and comprehensibility.

Most of the journals declare adoption of some research open data and data archiving policy, and 'data availability statement' is required in a paper. With minor exceptions of confidential data, 'data not shown' statements are no longer accepted; instead, the research data underlying all claims and conclusions must be provided. If the data cannot be included in the paper itself, its deposition in public repositories is suggested. Institutional or private repositories are rarely allowed. In some cases attaching a small dataset as size-constrained supplementary information to the online article is also accepted. 

Only a few specific data types are governed and validated against domain-specific metadata standards or reporting guidelines. A set of required repositories are usually explicitly given for well established measurement types resulting in homogeneous data such as nucleotide sequences (DDBJ , ENA, GeneBank), amino acid sequences (UniProt/Swiss-Prot) and structures (PDB), microarrays (ArrayExpress), metabolites (MetaboLights) or chemical structures (ChemSpider, PubChem). More heterogeneous submissions of selected species are appointed to dedicated repositories, e.g. Arabidopsis genome functional annotation data (TAIR) or human genome-phenome interactions (dbGaP, EGA). Other data types are expected to be put into general-purpose repositories such as Dryad figshare  or Zenodo , or a more scientific GigaDB . For all but one journal this is the case of plant phenotyping data, not being explicitly mentioned in the data policies described on their websites.

The only publisher accounting for plant data covering phenotypes is GigaScience , which directs to a dedicated public repository of PGP   . Another type-specific repository assigned to an ambiguous category of ‘trait data’ is the TRY  database. Although the storage is indeed dedicated to a wide range of plant trait information, its purpose advertised in   is not being a running repository for experimental datasets, but an analytical processing database for environmental compilations.

Some journals direct an ambitious author to a policy register FAIRsharing  (formerly MIBBI and BioSharing, still referenced to as such) to identify and follow a suitable standard on their own. Some other links provided by the journals, lead to DataCite’s   re3data.org  repository catalogue that lists institutional, project- or community-specific resources, where also phenotyping databases of different scope and governed by diverse requirements can be found. 

Alas, dataset publishing alone does not ensure sufficient data comprehensibility or reproducibility. No systematic validation of whether the dataset complies with recognised standards seems to be done by the journals. General-purpose storages, contrary to the explicitly recommended dedicated databases, do not impose any constraints on the content and quality of the submissions. Individual repositories for phenotypic data might be governed by provisional guidelines appropriate for current applications rather than long terms storage with the view of replicability and interoperability. Lack of journals' consistent data validation policy poses a threat of missing out some important metadata qualities and thus losing a unique and potentially valuable contribution to the explanation of biological mechanisms, at the same time letting the storages fill with non-reusable datasets. 

Plant phenotypic data

Data characteristics

Plant phenotypic data constitute one of the fundamental bricks of plant science. Phenotypic traits result from the interaction of organism’s genome and environment, and are manifested as organism's observable characteristics. Phenotypic traits are of different nature (qualitative or quantitative) and granularity (e.g cellular or whole plant properties). They can be determined by diverse techniques (e.g. visual observation, manual measurement, automated imaging and processing, or analysis of samples by sophisticated devices) in different time regimes (one-time capture, repeated or time-series) and expressed in different scales and units. The general notion of plant phenotyping encompasses many types of observations done at different levels of material granularity and technical complexity, and pertains to a huge fraction of plant research. 

As mentioned before, many types of measurements that produce a homogeneous set of specific, in-depth observations are well-standardized; they have a dedicated standard for data description and formatting, as well as measurement-specific data repositories. Meanwhile, a wide range of “traditional” measurements done with a variety of non-standardised, non-automated or low-throughput methods are collected and analysed in laboratories every day. Frequently they apply to basic plant qualities like yield, fruit taste and colour, or disease resistance. Those basic phenotyping data are indispensable for interpretation of the in-depth "-omics" analyses by providing the ultimate observable result of the biological mechanisms under study. 

In plant experiments, the relationships between the phenotypic qualities and environmental conditions are to be explained. Thus, a precise description of the experiment's environment is an integral part of all plant phenotyping datasets and makes each of them unique. The well-documented datasets are valuable not only due to specific environmental conditions that contribute to the observed phenotype, but also because they are frequently expensive and time-consuming to produce. Thus each experiment deserves scrupulous data handling.

Huge diversity of plant studies performed by researchers to investigate mechanisms of plant reactions to different stimuli makes it difficult to establish one fixed approach to handling the description of the process and its results. While standardising experimental procedures is undesirable, as it could restrict scientific findings and hinder innovative approaches and methodologies, the standardisation of experimental procedures' description is necessary. It should be done in a way that is both precise and flexible, so that it is possible to interact with datasets of different types in a similar way, and yet the standardisation does not impose constraints leading to losing details of non-standard aspects of particular research. In case of plant phenotyping, irrespective of the type of organisms, their treatments and measurements done, a set of common abstract properties of such experiments can be recognised. A number of common organism properties, environmental properties and foreseeable steps in the preparation of the plant material and its growth can be identified, named and described; and eventually used for data validation, searching, processing and analysis. Identification and providing of those general common experimental metadata is a step towards taming the broad field of plant phenotyping data.

Plant dataset can be used on their own by the researchers, breeders and farmers to screen plant varieties and identify genotypes that behave in a special way in certain conditions. Together with genomic marker data, phenotypes make a crucial part of QTL or GWAS analyses, and in conjunction with environmental data serve in genome-environment interaction analysis. The lack of explicit consideration (in recommendations, validation process, encouragement to publish, and access to databases or keywords) for this non-specific yet important data type, even in biological journals, makes it dissipate in the general repositories and limits the possible biological research findings. 

Plant phenotyping community

In the community of plant researchers dealing with phenotyping data, the work on new improved approaches to plant experimental data description, modelling and processing is ongoing. Numerous project- or organism-specific databases and tools exist, and many solutions are successfully implemented to describe and store plant phenotyping datasets at individual institutions, organisations, and companies. Care is taken of appropriate data description by requesting from users to use specific data submission forms, models or ontologies according to internal standards. 

Lately, the work of many groups in plant phenotyping community is aimed at facilitating plant phenotypic data exchange and reuse through improvement of experimental data standardisation and description guidelines. 

Proposed solutions and ongoing efforts

In the absence of common standards addressing the description of plant phenotyping experiments, the Minimum Information About a Plant Phenotyping Experiment (MIAPPE)  recommendations have been proposed in    by a joint effort of big projects dedicated to plant infrastructure development: transPLANT EPPN ,  and Elixir-Excelerate . MIAPPE constitutes a checklist of general properties needed to describe a plant experiment, so that the observed phenotypes can be interpreted as a result of the interaction of a genome and environment. To enable standardisation of the wording used in the description of all elements specified in MIAPPE, a set of ontologies and taxonomies for their semantic annotation has been recommended. Specific vocabularies to use with environment characterisation are being collaboratively developed by phenotyping community projects (EPPN2020 EMPHASIS ).

The MIAPPE recommendations are implemented in a number of databases (PGP,  EphesIS PlantPhenoDB  and GWA-Portal ). There is an ongoing work on harmonising implementations of Breeding API  (a standard interface for plant phenotype/genotype databases) with MIAPPE. A semantic representation  of MIAPPE-compliant plant phenotyping experiment model is being developed. A reference implementation for flat file data exchange has been proposed in ISA-Tab format .

On top of the phenotyping dataset description standard, a number of data processing, validation, quality assessment and statistical analysis tools exist or are being constructed, e.g., MIAPPE-based configuration to use with ISA-Tools  for dataset creation and validation, data parsers in online tools provided by COPO  infrastructure, or individual institutions' data processing tools for exporting datasets as flat files from their systems. Common semantics has proved beneficial for the implementation of distributed search tools for plant data resources, like transPLANT search  or WheatIS search . As a follow-up, applications taking advantage of the enhanced data semantics will be built, conceivably offering advanced and innovative approaches for data exploration and analysis.

Limitations

The above-mentioned solutions proposed by the plant phenotyping community work towards establishing best practices in data management across phenotyping facilities. Although the Minimum Information approach might not be sufficient to cover all aspects of specific research types, it helps ensure the presence of the basic common properties in the description of heterogeneous phenotyping studies. Ideally, MIAPPE used as a checklist while preparing a dataset should provoke the researchers to think of documenting all related, meaningful aspects of the experiment. The approach could become even more relevant if the individual researchers, also those with small amounts of phenotypic observations possibly linked to other data types, were incentivised to provide good quality phenotyping datasets to repositories and papers. 

Are there enough reasons for publishing research data? As shown by Piwowar et al. in   for microarray clinical trial publications, there is a correlation between public availability of the research data and the number of paper citations, which should work as a natural motivation for both the researchers and the publishers. Moreover, citations of the dataset itself and its reuse, leading to the increase of author's indices or altmetrics statistics could be another measurable benefit for the researcher. Finally, availability of datasets with good quality metadata should enable conducting new interesting research involving integrative and comparative studies or meta-analyses. Additionally, it is bound to result in the development of data-driven applications for managing and analysis of datasets to facilitate researchers' work. 

Unfortunately those benefits might be intangible for many plant phenotypic data producers, not having their research field recognized by the journals, and thus discouraged from publishing such data and its reuse by perceived unsure reliability. The potential benefits might be also unconvincing compared to the amount of work necessary from researchers to individually look for and follow the metadata guidelines for the field.

It appears that without journals' putting more stress on the publication of complete and well-described phenotypic datasets, the work on harmonising phenotyping resources and tools taking advantage of data semantics will be restricted only to huge players, i.e. research institutions administering own permanent storage systems and implementing own solutions. Meanwhile, smaller scientific units and researchers unable or not motivated to publish the phenotyping observations together with the standardised and adequate metadata will keep producing single-use-by-author-only datasets.

Conclusions

The field of plant phenotyping data publishing, lacking explicit recommendations, public dedicated repositories and curation, is likely to generate lots of datasets whose storage might be questionable. Unsupervised data publications tend to be incomplete, and thus provide data unsuitable for understanding, replication and re-analysis of the original experiment. To avoid data loss and confusion by single-use-by-author-only datasets, individual researchers should be encouraged and assisted in publishing phenotypic observations with the standardised and adequate metadata. 

Despite the variety of plant phenotyping studies and difficulties in pointing to one, unequivocally accepted data description standard, some approach of the publishers to systematically validate the quality of the phenotyping data submissions would be beneficial. Initially, it could be required of the authors to follow some phenotyping-specific reporting guidelines (possibly one of a few reasonable ones), or explicitly stating if this cannot be done in some respects and why. Thus the gaps in the existing recommendations should be identified and a commonly accepted standard would be successively shaped.

An explicit consideration of plant phenotypic data by the biological journals, especially the plant-focused ones, is desirable. Recommendations for data-aware storages and a registry of phenotyping datasets could be made available to stimulate progress in both biological research and development of smarter tools and algorithms to deal with it. 

We hope that with the ongoing initiatives of the plant phenotyping community, and with the raising awareness of both publishers and researchers, the presence of public phenotyping datasets in complex and innovative analyses will be facilitated and promoted. Introducing better semantics to public plant phenotyping data will trigger the development of integrative analyses methods, expectantly leading to great new discoveries. 

Acknowledgements

The work was funded by National Science Centre, Poland, project No. 2016/21/N/ST6/02358.

References

Dryad repository:  https://www.datadryad.org 

Figshare repository: https://figshare.com/ 

Zenodo repository: https://zenodo.org/ 

GigiDB repository: http://gigadb.org 

GigaScience journal: https://academic.oup.com/gigascience 

PGP repository: https://edal.ipk-gatersleben.de/repos/pgp/ 

TRY database: https://www.try-db.org/TryWeb/Home.php 

FAIRsharing standard registry: https://fairsharing.org 

DataCite organisation: https://www.datacite.org 

Re3data.org repository registry: https://www.re3data.org 

MIAPPE website : http://miappe.org 

transPLANT project: http://www.transplantdb.eu 

EPPN project: http://www.plant-phenotyping-network.eu 

Elixir-Excelerate project: https://www.elixir-europe.org/about-us/how-funded/eu-projects/excelerate 

EPPN2020 project: https://eppn2020.plant-phenotyping.eu 

EMPHASIS infrastructure: https://emphasis.plant-phenotyping.eu 

EphesIS database: https://urgi.versailles.inra.fr/ephesis 

PlantPhenoDB database: http://cropnet.pl/plantphenodb/ 

GWA-Portal: https://gwas.gmi.oeaw.ac.at/ 

Breeding API project: https://brapi.org 

Plant Phenotyping Experiment Ontology: http://agroportal.lirmm.fr/ontologies/PPEO 

ISA-Tab format: http://isa-tools.org/format/specification/ 

ISA-Tools software suite: http://isa-tools.org/software-suite/ 

COPO project: https://copo-project.org 

TransPLANT distributed search: http://www.transplantdb.eu/search/transPLANT/ 

WheatIS distributed search: https://urgi.versailles.inra.fr/wheatis/