[ToDo: COS] All but three items in ASAPbio’s
requirements are complete (items: ii, iii, and vi from Section 2 above).
Integration with Crossref metadata for easier discovery of other
versions. On inclusion of full-text, context of results will be
displayed. OSF’s rendering system supports conversion pipelines, UI to
download in alternative formats will be available. For item ix, we will
conduct an accessibility review.
Provide an alerting tool that delivers emails/other
notifications to authors when content of interest (by keyword, author,
citing their article, etc) appears.
[Done: COS] Users can receive notifications never,
as they occur, or as a daily digest. Notifications are enabled for
behaviors related to commenting and adding files. Users have control of
notification settings at the user and project level. Authors could elect
to receive notifications for some preprints and not others.
[ToDo: COS] Subscriptions to notifications by
author, keyword, discipline, service. Surfacing notifications control
about one’s own preprints more efficiently.
Respondents are also invited to highlight other functionality
they would suggest the site should support, although all features and
functionality of the CS will require Governance Body approval.
See section 9 (page 20)
Preprints Commons Machine Interface
API for manuscripts
Make available the full text of preprints (converted version and author’s original files) by RESTful API and bulk download
OAI-PMH endpoint, or equivalent if standards change over time
Capture metrics on API use
[Done: COS] A RESTful API is available for hosted
preprints; converted output will be added.
[ToDo: COS] For full-text harvested from other
services, a similar API will be available. Bulk downloads will be
accessible as time-based archive files, torrents, and, potentially,
rsync. An OAI-PMH is due for release in May 2017. A new metrics API
aligned with community efforts (COUNTER) is scheduled for Q3 for more
robust queries.
Screening training set
Make available all manuscripts (both those that passed
screening and those that did not) and their screening histories for
bulk download, with any sensitive information redacted. This corpus
will be used to train automated screening algorithms.
[ToDo: COS] We need clarification of the
requirements for this item. Creating a full dataset will be
straightforward if preprint services are willing and able to share
full-text of approved and rejected manuscripts.
We also need clarification on how the automated screening algorithms can
be trained if the sensitive information is redacted and the redacted
information is the basis for rejection. Our operating assumption is that
the training dataset will remain secure. Researchers developing
screening algorithms would gain approved access.
Manuscript Conversion Tool
Convert the full text of manuscripts to XML or equivalent,
tagged according to JATS4R standards. As a minimum it should be
possible to convert a .docx or .tex file to XML. The feasibility of
converting a .pdf file to XML should be discussed.
[Done: Partners] Manuscripts completes document import and
export operations with a cross-platform component executable as a web
service that orchestrates pre-existing open source and custom document
transformation steps. Heuristics infer and correct format specific
issues in Word, LaTeX and Markdown documents. A dozen input formats are
supported (DOC, DOCX, DocBook XML, Evernote XML, EPUB, LaTeX, Markdown,
ODT, OPML, Markdown variants, ReStructured Text, Microsoft style RTF,
Textile).
The document importers support scholarly needs: footnotes, endnotes,
equations, cross-references, embedded citation metadata is processed
from EndNote, Papers, Zotero, Mendeley formatted field codes in case of
docx, tracked changes are automatically applied. Citations, equations,
figures and tables with their captions are captured from LaTeX and
converted to XML. The implementation is hardened based on a large
collection of user provided documents.
[To do: Partners] Prepare for open sourcing. We expect to
achieve basic PDF-to-XML extraction: indexing grade text, extracting
figures, document outline. Experience to date suggests that efforts to
infer structure in PDF documents beyond this carries high implementation
risk and diminishing returns; PDF is a format for drawing for print,
translating it into structured rich text only incrementally improves
indexing quality, and will not yield XML to produce a readable document.
Allow authors to preview, proof, and modify the converted XML
through an author-friendly interface
[Done: Partners] Editing experience is fully
WYSIWYG. The format produced by the editor follows the JATS4R
recommendations and document validity is enforced continuously as user
edits are persisted to a version history. In addition to JATS4R,
Manuscripts can provide HTML5 export capabilities, normalised to the
Scholarly HTML tagging conventions drafted by the W3C Scholarly HTML
Community Group.
[ToDo: Partners] These services will be adapted to
allow preprints authors to preview, proof, and modify the converted XML.
The editing experience will be integrated with the preprints
infrastructure for accounts, storage, and notifications.
The manuscript conversion tool must be able to interoperate with
a wide range of preprint servers and journals that operate on
different technological platforms.
[ToDo: Partners] The service will be modularized
such that preprint services could integrate with their own workflows. We
will expose PressRoom document conversion service via REST API.
The conversion process could occupy several different positions
in the pipeline of manuscripts, and the preferred position will depend
on input from the GB and the ingestion sources. Describe how you would
implement the following options (and any others that you foresee will
be compatible with many ingestion sources) and indicate which you see
as the preferred option.
All material coming into the Commons from ingest sources will
be ingested as structured XML. In this case, the conversion tool is
offered as a software or hosted as a Commons service accessible by
API, and individual ingest sources are responsible for implementing
it or generating XML by alternative means.
Material coming from ingest sources can be provided to the
Commons as the author’s original manuscript file (such as .docx or
.tex). In this case, the Commons would convert the manuscript and
contact authors (ie by email) to invite them to proof a rendition of
the manuscript after conversion to XML. The ingest source could
later retrieve the converted manuscript from the Commons.
The conversion tool could function upstream of ingest sources.
A submission tool hosted by the CS could convert manuscripts for
author proofing and then provide options for authors to send the
converted manuscripts (or original files, or PDFs) to other ingest
sources.
[ToDo: Partners] A RESTful web service to document
conversion will accept any 14 supported input file formats as input, and
provide HTML5 or JATS XML as output. This includes document conversion
and a native desktop-based application for proofing.
Automated Screening Tool
Flag manuscripts with certain characteristics similar to
manuscripts that have not passed screening. Initially, include
factors that may include single authorship, presence of keywords,
lack of scientific writing style, image style, lack of biological
subject matter, presence of human faces, etc.
[ToDo: COS] As the corpus of preprints builds in a
format conducive to programmatic access (e.g., JATS), properties of the
preprints will be extracted and persisted. As submissions are determined
to be spam, non-scientific, etc., these properties will be associated
with the reason for disqualification. This data will be used to compare
incoming submissions as part of the submission pipeline, flagging for
human intervention distribution outliers and manuscripts similar to
those previously disqualified (compared using multiple metrics that can
be used in, e.g., random forest models and support vector machines). We
will ask peer preprint services for access to their screened-out
manuscripts, if they have license to share them for the corpus.
Early comparison data will come from metadata currently harvested by
SHARE and full-text collected for use in The Commons aggregated search
and machine-learning corpus. Heuristic and targeted identification will
flag submissions. Face recognition (e.g. OpenCV) will be run against
extracted figures to notify moderators of potential ethical violations.
Lee Giles (PSU; CiteSeerx) will consult for these
tasks, especially for advanced use-cases.
Use external services (like plagiarism detection) but ideally
develop a viable open source alternative in the long term.
[ToDo: COS] Membership with Crossref supports using their
Similarity Check (powered by iThenticate) to flag suspected plagiarism.
Other screening tools will be made open-source, but, for plagiarism in
particular, the tool would require a corpus that is most likely
impractical to license from publishers compared to using a service like
Similarity Check, which has been pre-negotiated for members.
Provide service accessible by API
[ToDo: COS] Pipelines created for screening will be
accessible by API.