Data structures are the foundation upon which computational tools are built. For example, the simple pointer-to-memory approach, established by languages such as Fortran and C, acts as a _de facto_ standard by which different packages and libraries can interoperate with a single shared array of numerical data in memory. While this simple abstraction for n-dimensional arrays has served us well in the past, there is a clear need for data structures that have richer semantics and make it easy to express and manipulate common forms of (semi-)structured data. This need is highlighted by the popularity of R’s data frames and Python libraries, such as bcolz (column storage), pandas (indexed data frames), and X-ray (n-dimensional indexed arrays). This paper aims to present the state of the art in data structures, across programming languages and implementation details, that are foundational in data science, scientific computing, and statistical applications. It will review current data representation semantics implemented by various libraries, packages, and languages, with an explicit emphasis on interoperability across languages and process boundaries.
ABSTRACT A central tenet in support of research reproducibility is the ability to uniquely identify research resources, i.e., reagents, tools, and materials that are used to perform experiments. However, current reporting practices for research resources are insufficient to identify the exact resources that are reported or answer basic questions such as “How did other studies use resource X?”. To address this issue, the Resource Identification Initiative was launched as a pilot project to improve the reporting standards for research resources in the methods sections of papers and thereby improve identifiability and reproducibility. The pilot engaged over 25 biomedical journal editors from most major publishers, as well as scientists and funding officials. Authors were asked to include Research Resource Identifiers (RRIDs) in their manuscripts prior to publication for three resource types: antibodies, model organisms, and tools (i.e. software and databases). RRIDs are assigned by an authoritative database, for example a model organism database, for each type of resource. To make it easier for authors to obtain RRIDs, resources were aggregated from the appropriate databases and their RRIDs made available in a central web portal (scicrunch.org/resources). RRIDs meet three key criteria: they are machine readable, free to generate and access, and are consistent across publishers and journals. The pilot was launched in February of 2014 and over 300 papers have appeared that report RRIDs. The number of journals participating has expanded from the original 25 to more than 40. Here, we present an overview of the pilot project and its outcomes to date. We show that authors are able to identify resources and are supportive of the goals of the project. Identifiability of the resources post-pilot showed a dramatic improvement for all three resource types, suggesting that the project has had a significant impact on reproducibility relating to research resources.
ABSTRACT: The time delays between point-like images in gravitational lens systems can be used to measure cosmological parameters as well as probe the dark matter (sub-)structure within the lens galaxy. The number of lenses with measuring time delays is growing rapidly due to dedicated efforts. In the near future, the upcoming _Large Synoptic Survey Telescope_ (LSST), will monitor ∼10³ lens systems consisting of a foreground elliptical galaxy producing multiple images of a background quasar. In an effort to assess the present capabilities of the community to accurately measure the time delays in strong gravitational lens systems, and to provide input to dedicated monitoring campaigns and future LSST cosmology feasibility studies, we pose a “Time Delay Challenge” (TDC). The challenge is organized as a set of “ladders,” each containing a group of simulated datasets to be analyzed blindly by participating independent analysis teams. Each rung on a ladder consists of a set of realistic mock observed lensed quasar light curves, with the rungs’ datasets increasing in complexity and realism to incorporate a variety of anticipated physical and experimental effects. The initial challenge described here has two ladders, TDC0 and TDC1. TDC0 has a small number of datasets, and is designed to be used as a practice set by the participating teams as they set up their analysis pipelines. The non mondatory deadline for completion of TDC0 will be December 1 2013. The teams that perform sufficiently well on TDC0 will then be able to participate in the much more demanding TDC1. TDC1 will consists of 10³ lightcurves, a sample designed to provide the statistical power to make meaningful statements about the sub-percent accuracy that will be required to provide competitive Dark Energy constraints in the LSST era. In this paper we describe the simulated datasets in general terms, lay out the structure of the challenge and define a minimal set of metrics that will be used to quantify the goodness-of-fit, efficiency, precision, and accuracy of the algorithms. The results for TDC1 from the participating teams will be presented in a companion paper to be submitted after the closing of TDC1, with all TDC1 participants as co-authors.