Revamp C2M2 docs (#239)

* Update C2M2 text (#210) * Fix link rendering * Handle external C2M2 doc rendering (#167)
MaayanLab · Apr 24, 2024 · 5c4d229 · 5c4d229
1 parent 24e471c
commit 5c4d229
Show file tree

Hide file tree

Showing 11 changed files with 277 additions and 280 deletions.
diff --git a/drc-portals/app/info/documentation/C2M2/[doc]/page.tsx b/drc-portals/app/info/documentation/C2M2/[doc]/page.tsx
@@ -0,0 +1,50 @@
+import { Grid } from "@mui/material";
+import ReactMarkdown from 'react-markdown'
+import remarkGfm from "remark-gfm"
+import { fetchC2m2Markdown, LinkRenderer, HeadingRenderer } from '@/components/misc/ReactMarkdownRenderers'
+
+export default async function StandardsPage(
+  { params } : { params: { doc: string } }
+) {
+    function fixProblematic ( page: string ) {
+      const problematic = [
+        '-subject_phenotype.tsv',
+        '-phenotype_gene.tsv',
+        '-phenotype_disease.tsv',
+        '-phenotype.tsv',
+        '-collection_phenotype.tsv',
+        '-collection_disease.tsv',
+        '-analysis_type.tsv'
+      ]
+      for (let pg of problematic) {
+        if (page.endsWith(pg)) {
+          return page.replace('TableInfo', 'Tableinfo').concat('.md')
+        } 
+      }
+      return page.concat('.md')
+    }
+
+    const suffix = fixProblematic(params.doc)
+    const markdown = await fetchC2m2Markdown(suffix)
+    const title = '# ' + suffix.split('-')[1].split('.md')[0] + '\n\n'
+    return (
+      <Grid container sx={{ml:3, mt:3}}>
+        <Grid item sx={{mb:5}}>
+          <br/>
+          <ReactMarkdown 
+            remarkPlugins={[remarkGfm]}
+            components={{ 
+              a: LinkRenderer,
+              h2: HeadingRenderer,
+              h3: HeadingRenderer
+          }} className="prose">
+            {''.concat(
+              title,
+              markdown.replace('https://docs.nih-cfde.org/en/latest/c2m2/draft-C2M2_specification/', 'https://github.com/nih-cfde/c2m2/blob/master/draft-C2M2_specification/'),
+              '\n#### Return to [C2M2 Documentation](./)' 
+            )}
+          </ReactMarkdown>
+        </Grid>
+      </Grid>
+    )
+}
diff --git a/drc-portals/app/info/documentation/C2M2/intro.md b/drc-portals/app/info/documentation/C2M2/intro.md
@@ -0,0 +1,40 @@
+## Introduction
+The Crosscut Metadata Model (C2M2) is a flexible metadata standard for describing experimental resources in biomedicine and related fields. A complete C2M2 submission, also called an "instance" or a "datapackage", is a zipped folder containing multiple tab-delimited files (TSVs) representing metadata records along with a JSON schema. To read more about datapackages, skip to the [Frictionless Data Packages](#frictionless-data-packages) section. 
+
+Each TSV file is a **data table** containing various **data records** (rows) and their values for different **fields** (columns). **Entity tables** describe various types of data objects, while **association tables** describe the relationships between different entities. 
+
+This page is adapted from the [original C2M2 documentation](https://github.com/nih-cfde/published-documentation) developed by the CFDE Coordination Center (CFDE-CC). 
+
+## Resources
+* The [c2m2-frictionless-dataclass](https://github.com/nih-cfde/c2m2-frictionless-dataclass/tree/main) 
+    - This repository includes the `c2m2-frictionless` Python package, which contains specific helper functions for C2M2 datapackage building and validation. 
+    - This package is not designed to generate a complete C2M2 datapackage from any given data, but should be used in collaboration with the provided schema and ontology preparation scripts, as well as with DCC-specific scripts.
+* The most up-to-date [C2M2 JSON schema](https://osf.io/c63aw/)
+* The most up-to-date [C2M2 ontology preparation script and files](https://osf.io/bq6k9/)
+* The original [C2M2 documentation](https://github.com/nih-cfde/c2m2/blob/master/draft-C2M2_specification/README.md) from the CFDE Coordination Center contains more details on the concepts discussed here.
+
+## Frictionless Data Packages
+C2M2 instances are also known as "datapackages" based on the [Data Package](http://frictionlessdata.io/docs/data-package/) meta-specification from [Frictionless Data](http://frictionlessdata.io/). 
+
+From the original C2M2 documentation: 
+> The Data Package meta-specification is a platform-agnostic toolkit for defining format and content requirements for files so that automatic validation can be performed on those files, just as a database management system stores definitions for database tables and automatically validates incoming data based on those definitions. Using this toolkit, the C2M2 JSON Schema specification defines foreign-key relationships between metadata fields (TSV columns), rules governing missing data, required content types and formats for particular fields, and other similar database management constraints. These architectural rules help to guarantee the internal structural integrity of each C2M2 submission, while also serving as a baseline standard to create compatibility across multiple submissions received from different DCCs.
+
+## Identifiers
+In order to standardize and integrate information across DCCs, there must be a system of assigning unambiguous identifiers to individual DCC concepts and resources. These are the "C2M2 IDs", consisting of a `id_namespace` prefix and `local_id` suffix. Additionally, the C2M2 also allows individual resources to be assigned a `persistent_id`. 
+
+From the original C2M2 documentation: 
+> Optional `persistent_id` identifiers are meant to be stable enough to be scientifically cited, and to provide for further investigation by accessing related resolver services. To be used as a C2M2 `persistent_id`, an ID
+> 1. will represent an explicit commitment by the managing DCC that the attachment of the ID to the resource it represents is permanent and final
+> 2. must be a format-compliant URI or a compact identifier, where the protocol (the "scheme" or "prefix") specified in the ID is registered with at least one of the following (see the given lists for examples of URIs and compact identifiers)
+>     - the IANA (list of registered schemes)
+>     - scheme used must be assigned either "Permanent" or "Provisional" status
+>     - Identifiers.org (list of registered prefixes)
+>     - N2T (Name-To-Thing) (list of registered prefixes)
+>     - protocols not appearing in the above registries but explicitly approved by the CFDE-CC. Currently, this list is limited to one protocol, namely drs:// URIs identifying GA4GH Data Repository Service resources.
+> 3. if representing a file, an ID used as a `persistent_id` cannot be a direct-download URL for that file: it must instead be an identifier permanently attached to the file and only indirectly resolvable (through the scheme or prefix specified within the ID) to the file itself
+
+## C2M2 Tables
+
+*Sourced from the [CFDE Coordination Center Documentation Wiki](https://github.com/nih-cfde/published-documentation/wiki/C2M2-Table-Summary)*
+
+All of the tables in a C2M2 datapackage are inter-linked via foreign key relationships, as shown in the following diagram of the complete C2M2 system. 
diff --git a/drc-portals/app/info/documentation/C2M2/page.tsx b/drc-portals/app/info/documentation/C2M2/page.tsx
@@ -0,0 +1,62 @@
+import { Grid, Typography } from "@mui/material";
+import ReactMarkdown from 'react-markdown'
+import remarkGfm from 'remark-gfm'
+import { fetchC2m2Markdown, LinkRenderer, HeadingRenderer } from '@/components/misc/ReactMarkdownRenderers'
+import { readFileSync } from 'fs'
+
+const toc = `
+## Table of Contents
+* [Introduction](#introduction)
+* [Resources](#resources)
+* [Frictionless Data Packages](#frictionless-data-packages)
+* [Identifiers](#identifiers)
+* [C2M2 Tables](#c2m2-tables)
+* [Submission Prep Script](#submission-prep-script)
+* [General Steps](#general-steps)
+* [Tutorial](#tutorial)
+`
+
+export default async function C2M2Page() {
+  const intro = readFileSync(
+    'app/info/documentation/C2M2/intro.md', 
+    {encoding:'utf8', flag:'r'}
+  ) 
+  const c2m2Tables = await fetchC2m2Markdown(
+    'C2M2-Table-Summary.md'
+  )
+  const submissionPrep = await fetchC2m2Markdown(
+    'submission-prep-script.md'
+  )
+  const tutorial = readFileSync(
+    'app/info/documentation/C2M2/tutorial.md', 
+    {encoding:'utf8', flag:'r'}
+  )
+  return (
+    <Grid container sx={{ml:3, mt:3}}>
+      <Grid item xs={12}>
+        <Typography variant="h1" color="secondary.dark">C2M2</Typography>
+      </Grid>
+      <Grid item sx={{mb:5}}>
+        <br/>
+        <ReactMarkdown 
+          remarkPlugins={[remarkGfm]}
+          components={{ 
+            a: LinkRenderer,
+            h2: HeadingRenderer,
+            h3: HeadingRenderer
+        }} className="prose">
+          {''.concat(
+            toc, // table of contents
+            intro, // intro
+            c2m2Tables.replaceAll('./', './C2M2/').replace('/submission-prep-script', '#submission-prep-script'), // C2M2 table descriptions from original docs
+            '\n## Submission Prep Script\n',
+            '*Sourced from the [CFDE Coordination Center Documentation Wiki](https://github.com/nih-cfde/published-documentation/wiki/submission-prep-script)*\n',
+            submissionPrep.replace('## Overview', '').replace('## Usage', '## General Steps').split('This script is under')[0], //
+            tutorial, // overview of steps
+            '\n#### Return to [Documentation](./)'
+          )}
+        </ReactMarkdown>
+      </Grid>
+    </Grid>
+  )
+}
diff --git a/drc-portals/app/info/documentation/C2M2/tutorial.md b/drc-portals/app/info/documentation/C2M2/tutorial.md
@@ -0,0 +1,104 @@
+## Tutorial
+For the April 2022 CFDE Cross-Pollination meeting, the LINCS DCC demonstrated a Jupyter Notebook tutorial on building the `file`, `biosample`, and `subject` tables for LINCS L1000 signature data. The code and files can be found at the following link: 
+
+[LINCS C2M2 Demo (04-05-2022 Cross-Pollination)](https://github.com/nih-cfde/LINCS-metadata/tree/main/scripts/April-CrossPollination-Demo)
+
+Code snippets from this tutorial corresponding to each step are reproduced below. Note that the C2M2 datapackage building process will vary across DCCs, depending on the types of generated data, ontologies, and access standards in place. In general, the process will be as follows: 
+
+1. Become familiar with the current structure of the C2M2, including the required fields across the entity and association tables, and download the most recent version of the [JSON schema](https://osf.io/c63aw/). Gather any metadata mapping files you may need. 
+
+2. Identify the relevant namespace for all files, and build the `id_namespace` and `dcc` tables first. 
+
+    - For LINCS, the namespace "https://lincsproject.org" is used. The following code will generate the LINCS `id_namespace` table: 
+    ```
+    pd.DataFrame(
+      [
+        [
+          'https://www.lincsproject.org', # id
+          'LINCS', # abbreviation
+          'Library of Integrated Network-Based Cellular Signatures', # name
+          'A network-based understanding of biology' # description
+          ]
+      ], 
+      columns=['id', 'abbreviation', 'name', 'description']
+    ).to_csv('id_namespace.tsv', sep='\t', index=False)
+    ```
+
+3. Identify all relevant projects and their associated files that will be included in the C2M2 datapackage. Generate container entity tables (`project`, `collection`) that describe logic, theme, or funding-based groups of the core entities. Note that projects and collections may be nested, in the `project_in_project` and `collection_in_collection` tables. 
+
+    - In the LINCS tutorial, the data comes from the 2021 release of the LINCS L1000 Connectivity Map dataset. In creating a project representing the files from this dataset, there must also be an overarching root project for all LINCS data. 
+    ```
+    pd.DataFrame(
+      [
+        [ 
+          'https://www.lincsproject.org', # id_namespace
+          'LINCS', # local_id
+          'https://www.lincsproject.org', # persistent_id
+          date(2013, 1, 1), # creation_time
+          'LINCS', # abbreviation
+          'Library of Integrated Network-Based Cellular Signatures', # name
+          'A network-based understanding of biology' # description
+        ], 
+        [
+          'https://www.lincsproject.org', # id_namespace
+          'LINCS-2021', # local_id
+          'https://clue.io/data/CMap2020#LINCS2020', # persistent_id
+          date(2020, 11, 20), # creation_time
+          'LINCS-2021', # abbreviation
+          'LINCS 2021 Data Release', # name
+          'The 2021 beta release of the CMap LINCS resource' # description
+        ]
+      ], 
+      columns=[
+        'id_namespace', 'local_id', 'persistent_id', 'creation_time', 
+        'abbreviation', 'name', 'description'
+      ]
+    ).to_csv('project.tsv', sep='\t', index=False)
+
+    pd.DataFrame(
+      [
+        [
+          'https://www.lincsproject.org', # parent_project_id_namespace
+          'LINCS', # parent_project_local_id
+          'https://www.lincsproject.org/', # child_project_id_namespace
+          'LINCS-2021' # child_project_local_id
+        ]
+      ],
+      columns=[
+        'parent_project_id_namespace', 'parent_project_local_id',
+        'child_project_id_namespace', 'child_project_local_id'
+      ]
+    ).to_csv('project_in_project.tsv', sep='\t', index=False)
+    ```
+
+4. Determine the relationships between files and their associated samples or biological subjects, as well as all relevant assay types, analysis methods, data types, file formats, etc. Also identify all appropriate ontological mappings, if any, corresponding to each value from above. 
+
+    - The LINCS DCC includes internal drug, gene, and cell line identifiers, which were mapped to PubChem, Ensembl, and Disease Ontology/UBERON manually ahead of time, but other DCCs may already make use of the CFDE-supported ontologies. 
+    - For instance, the LINCS signature `L1000_LINCS_DCIC_ABY001_A375_XH_A16_lapatinib_10uM.tsv.gz` comes from the biosample `ABY001_A375_XH_A16_lapatinib_10uM` (in this case an experimental condition) and the subject cell line `A375`; has a data type of expression matrix (`data:0928`); is stored as a TSV file format (`format:3475`) with GZIP compression format (`format:3989`); and has a MIME type of `text/tab-separated-values`. 
+    - The `ABY001_A375_XH_A16_lapatinib_10uM` biosample was obtained via the L1000 sequencing assay type (`OBI:0002965`); comes from a cell line derived from the skin (`UBERON:0002097`); and was treated with the compound lapatinib (`CID:208908`). 
+
+6. Either manually or programmatically, generate each data table, starting with the core entity tables (`file`, `biosample`, `subject`). This step will depend entirely on the format of a DCC's existing metadata and ontology mapping tables. 
+
+7. Generate the inter-entity linkage association tables (`file_describes_biosample`, `file_describes_subject`, `biosample_from_subject`). 
+    
+    - In the LINCS tutorial, since the filenames come directly from the biosamples, once the `file` and `biosample` tables have both been built, `file_describes_biosample` can be generated. 
+    ```
+    fdb = file_df[['id_namespace', 'local_id']].copy()
+    fdb = fdb.rename(
+      columns={
+        'id_namespace': 'file_id_namespace', 
+        'local_id': 'file_local_id'
+      }
+    )
+    fdb['biosample_id_namespace'] = fdb['file_id_namespace']
+    fdb['biosample_local_id'] = fdb['file_local_id'].apply(file_2_biosample_map_function)
+    ```
+
+8. Assign files, biosamples, and subjects to any collections, if applicable, using the `file_in_collection`, `subject_in_collection`, and `biosample_in_collection` tables. 
+
+    - Collections are optional, and can represent files from the same publications or other logical groupings outside of funding. 
+
+9. Use provided [C2M2 submission preparation script and ontology support files](https://osf.io/bq6k9/) to automatically build term entity tables from your created files. 
+
+10. Validate the final datapackage, containing all files and the schema, and submit! 
+