Plan

Basic plan

Come up with a great bacronym for Habeas e.g. Helpfully Annotated Basic Encyclopedia of Academic Software
Take the CORD-19 Software Mentions dataset and clean it
Sort the dataset in order of popularity
Starting with the most popular software, split the software between us and manually identify whether the software has an associated code repository
Create the first Habeas Corpus from this exercise, which will just contain: name of the software, URL of associated code repository, list of identifiers of papers that mention it
Use this version of Habeas Corpus to test analyses that answer the questions we'd like to pose, and identify what information/metadata is missing
Go back to incorporate that metadata where simple
Present some of the questions we're now able to answer using Habeas Corpus

Based on our experience of 1. create instructions for cleaning the CORD-19 Software Mentions dataset
Based on our experience of 1. create a "software name deduplication mapping tool" which lets people know the aliases of common software titles
Create entries in Wikidata for each piece software, including its aliases, to give it a unique identifier
Create tools to make it easier to automatically

Question	Data / Metadata required	Notes
Most popular software	Software mentions, software aliases
Usage	?	How do we identify number of users? Number of unique publication authors who mention it? Number of unique publications that mention it?
Type of License	Link to code repository or license file	Should this be dynamically fetched?
Filter software by purpose	?	Requires classification of purpose
How software is being used	Where software is mentioned, other context?	Is there a good way of doing this in an automated way?
Analyse dependencies	Link to code repo
Software metrics	Link to code repo
Software project metrics (e.g. how often developed, contributors)	Link to code repo
Academic disciplines a piece of software is used in	Links to paper IDs that mention it
Who funded the software	Links to grants, funders	Perhaps through acknowledgements in the paper that mentions it? Only works if author created
Who funded the use of the software to do a piece of research	Link to paper ID and ability to extract research funder info from that
How detailed is the mention of the software (e.g. citation, version number, release year)	Full text encoding?	This might be better applied to the SoftCite dataset which has this information already coded and annotated
Summarise by platform, e.g. all Python packages	Platform metadata	Or some external source that allows understanding of what software is "similar"

Dependencies: we could create a utility that leverages GitHub's support for understanding dependencies
License type: we could use existing utilities for finding the license of a GitHub repo
Academic disciplines: we could create a tool that extracts JATS4R Subject and Keyword tags: https://www.niso.org/publications/rp-32-2019-jats4r-subjects-and-keywords-v1
Funders: we could use PID Graph / OpenAIRE funder database, or scrape from papers