Corinna Gries

and 9 more

The research data repository of the Environmental Data Initiative (EDI) is a signatory of the FAIR Data Principles. Building on over 30 years of data curation research and experience in the NSF-funded US Long-Term Ecological Research program (LTER), it provides mature functionalities, well established workflows, and support for ‘long-tail’ environmental data publication. High quality scientific metadata are enforced through automatic checks against community developed rules and the Ecological Metadata Language (EML) standard. Although the EDI repository is far along the continuum of making its data FAIR, representatives from EDI and the LTER Information Management community have recently been developing best practices for the edge cases in environmental data publishing. Here we discuss and seek feedback on how to best handle the publication of these ‘long-tail’ data when extensive additional data are available along with e.g., genomics data, physical specimens, or flux tower data. While these latter data are better handled in other discipline-specific repositories such as NCBI, iDigBio, and AmeriFlux, they are frequently associated with other data collected at the same time and location, or even from the same samples. This is particularly relevant across the LTER Network, where sites represent integrative research projects. Questions we address (and seek community input from) include: How to archive documents and images when they are data, e.g., field notebooks, or time-lapse photographs of plant phenology? How to deal with data from Unmanned Vehicle (e.g., drones and underwater gliders), acoustic data, or model outputs, which may be several terabytes in size? How should processing scripts or modeling code be associated with data? Overall, these best practices address issues of Findability and Accessibility of data as well as greater transparency of the research process.

Margaret O'Brien

and 2 more

Data repositories and research networks worldwide are publishing a diverse array of long-term and experimental data for meaningful reuse, repurpose, and integration. However, in synthesis research the largest time investment is still in discovering, cleaning and combining primary datasets until all are completely understood and converted to a usable format. To accelerate this process, we have developed an approach to define flexible domain specific data models and convert primary data to these models using a light-weight and distributed workflow framework. The approach is based on extensive experience in synthesis research workflows, takes into account the distributed nature of original data curation, satisfies the requirement for regular additions to the original data, and is not determined by a single synthesis research question. Furthermore, all data describing the sampling context are preserved and the harmonization may be performed by data scientists that are not specialists in each specific research domain. Our harmonization process is 3-phased. First, a Design Phase captures essential attributes, considers already existing standardization efforts, and external vocabularies that disambiguate meaning. Second, an Implementation Phase publishes the data model and best practice guides for reference, followed by conversion of relevant repository contents by data managers, and creation of software for data discovery and exploration. Third, a Maintenance Phase implements programmatic workflows that run automatically when parent data are revisioned using event notification services.In this presentation we demonstrate the harmonization process for ecological community survey data and highlight the unique challenges and lessons learned. Additionally, we demonstrate the maintenance workflow and data exploration and aggregation tools that plug in to this data model

Kristin Vanderbilt

and 4 more

In this era of open data and reproducible science, graduate students need to learn where and how to publish their data and to be conversant with the challenges inherent when re-using someone else’s data. The Environmental Data Initiative partnered with UNM Libraries and Florida Coastal Everglades LTER to organize a 1-credit, semester-long distributed graduate seminar to learn if this approach could be an effective mechanism for transmitting such information. Each week during the Spring 2021 semester, an informatics specialist spoke remotely to students at University of New Mexico, Florida International University, and University of Wisconsin-Madison on topics ranging from FAIR principles to data security, team science to data provenance. Students prepared for the lecture with one or more readings, and in-class exercises reinforced the material covered. Student assignments included writing quality metadata for their own data and archiving their data in the EDI Repository. The capstone writing assignment, a data management plan for their own research project, allowed the students to integrate much of what they had learned. Student response to this class was positive, and students indicated that they learned a lot of immediately useful information without the course being a significant time-sink. The low registration numbers at UNM and FIU (6 and 7 students, respectively), however, where the seminar was not required, suggest a need to better inform both students and their advisors of the opportunity and the value provided by the training. Instructors also learned that it would be easier to create a cohesive flow to the course, without repetition, if the group of instructors took turns lecturing, rather than bringing in specialists on each subject. It was also apparent from student comments that many felt this information should be integrated, at an introductory level, into undergraduate classes or classes for new graduate students.