PhytoOracle: Scalable, modular phenomic data processing pipelines

Gonzalez Emmanuel; Zarei Ariyan; Hendler Nathanial; Cosi Michele; Demieville Jeffrey; Calleja Sebastian; Simmons Travis; Ellingson Holly; Merchant Nirav; Lyons Eric; Pauli Duke

doi:10.1002/essoar.10508789.1

loading page

PhytoOracle: Scalable, modular phenomic data processing pipelines

Emmanuel Gonzalez,
Ariyan Zarei,
Nathanial Hendler,
Michele Cosi,
Jeffrey Demieville,
Sebastian Calleja,
Travis Simmons,
Holly Ellingson,
Nirav Merchant,
Eric Lyons,
Duke Pauli

Abstract

Previous crop yield improvements have been largely due to the implementation of new management strategies, mechanization, and application of emerging technologies. While these approaches have led to stable, linear improvements, increases in crop yields are currently plateauing. The use and improvement of rapid, automated, and accurate phenomic selection methods leveraging high-resolution data collected throughout a growing season could help identify stress-adaptive traits to meet the growing global food demand. As the capacity of phenomics to generate larger and higher dimensional data sets improves, there is an urgent need to develop and implement robust and scalable data processing pipelines for rapid turnaround of processed results. Current phenomics processing pipelines lack modularity and the ability to exploit the distributed computational infrastructure required for machine learning (ML)-based workloads. To address these challenges, we developed PhytoOracle (PO), a suite of modular, scalable pipelines that aim to improve data processing efficiency for plant science research. PO integrates open-source frameworks for distributed task management on local, cloud, or high-performance computing (HPC) systems. Each pipeline component is available as a standalone container which can be independently deployed or linked into a pipeline. Additionally, researchers can swap between available containers or integrate new ones suited to their specific research. PO extracts phenotype trait values such as volume, height, canopy temperature, and maximum quantum efficiency (F v /F m) of photosystem II from data captured in field settings, enabling the study of phenotypic variation for elucidation of the genetic components of quantitative traits.