Initial commit - Spring Batch Importer Pipeline #1

ao508 · 2016-08-18T21:41:09Z

General

added main class
added some data models (rest will be added as needed)
added a data file utils class (may not need ultimately)
added .gitignore
Standardized SQL statements and named parameters type
Updated DB version in master pom.xml
Datafiles for each datatype will be determined by metafile data_filename property. Special cases include clinical-supp where the same metafile is used as clinical, therefore default data filenames are also searched for in these cases. Other examples where the same meta filename is used for different datatypes include: mutation-germline and mutation-manual, which share the same meta filename as mutation.

Deleting cancer studies:

Added option to delete cancer study given a cancer study identifier
Added tasklet for deleting cancer study
Added job for deleting cancer study

Refactored importer:

removed persistence classes and models
using cbioportal/persistence repo instead

Configurations

batch configuration, data source configuration, and datatype configurations added
step execution configurations for importing will be organized by datatype

Job Execution:

added checks to make sure cancer study meta file exists and that the data was loaded before running rest of importer pipeline

DB Compatibility:

Added db schema check to be used by job decider to determine whether or not to run the job

Dao classes and repositories:

Created interfaces for Dao classes
Moved sql and database interactions to Dao classes

Cancer Study:

moved step/flow configuration out of BatchConfiguration

Clinical Data

step ex. configurations added for clinical data
all clinical datatypes use the same step components - execution is decided by step name
Updated clinical data writer and clinical data configurations
Changed import of new patients/samples to single imports vs batch
Added support for clinical supp files
Added support for bcr clinical files

Mutation Data:

added a composite maf record class to hold all possible fields in a maf record
Added step configuration for importing mutation data
Added a reader, processor, writer, listener, and tasklets for mutation data
Added utils classes for mutation data processing

Profile (Tab-delim) data:

Added a tasklet for loading profile metadata from datafile (header, non-case id columns, map of sample stable id to internal id, normal case id columns)
Added a reader for loading all profile (tab-delim) records from a profile datafile
Added general model for loading profile data from datafiles
Added a composite profile record model for supporting all profile (tab-delim) datatypes when sending data to processor and writer
Added a writer for all tab-delim data

CNA data:

Added CNA data step configuration
Added a CNA data processor

Protein-level data:

Added protein-level data step configuration

Gene expression data:

Added gene expression data step configuration
TO-DO: figure out unknown datatype metadata for mrna-seq-rsem and mrna-seq-fcount

Fusion data:

Added fusion data step configuration
Added reader and processor for fusion data

Methylation data:

Added methylation data step configuration

Mutsig data:

Added mutsig data step configuration
Added a tasklet for loading mutsig metadata
Added a reader for loading mutsig data from datafile
Added a writer for importing mutsig records into db
Added a model for mutsig records

Timeline data:

Added timeline data step configuration
Added a tasklet for loading timeline metadata
Added a reader for loading timeline data from datafiles
Added a processor for transforming timeline records into clinical events and clinical event data
Added a writer for importer timeline records into db
Added a model for timeline records

Copy number segment data:

Added copy number seg data step configuration
Added a tasklet for loading copy number seg metadata and import a copy number seg file record
Added a reader for loading copy number seg data from datafiles
Added a writer for importing copy number seg records into db
Added a model for copy number seg records

Gistic data:

Added gistic data step configuration
Added a tasklet for loading gistic metadata
Added a reader for loading gistic data from datafiles
Added a writer for importing gistic records into db
Added a model for gistic records

Structural variant data:

Added structural variant data step configuration
Added a tasklet for loading structural variant metadata
Added a reader for loading structural variant data from datafiles
Added a writer for importing structural variant records into db
Added a model for structural variant records

Case lists:

Added a reader for loading case lists from case list files
Added a writer for import case lists into db

Added documentation for:

main batch importer job
cancer study step
clinical data step
mutation data step
CNA data step
protein-level step
gene expression step
fusion data step
methylation data step
mutsig data step
timeline data step
copy number seg data step
gistic data step
structural variant data step
case list step
Added documentation: File-Formats.md (same as current version in cBioPortal/cbioportal)

Moved entry-skipping logic out of readers and into processors to improve performance.

entries that return as null from processors will automatically be filtered out of the list of entries that get sent to the writers
number of entries skipped can be retrieved from stepExecution.getFilterCount()

Signed-off-by: Angelica Ochoa [email protected]

To-Do:

Add unit/integration tests
Moving progress monitoring from logging (maybe Spring Batch admin or other?)
Added job decider to check db compatibility before import can begin
Moved sql and db interactions to Dao classes
Moved to java 8
Add documentation for intended work flow/docs folder

Datatypes supported:

n1zea144 · 2016-08-23T16:50:08Z

importer/src/main/java/org/cbio/portal/pipelines/importer/config/BatchConfiguration.java

+    private static final Log LOG = LogFactory.getLog(BatchConfiguration.class);   
+
+    @Bean
+    public Job batchImporterJob() throws Exception {


I know we talked about limiting the dependencies to just an external model module and work with the database directly. That means we do have a dependency on the db schema. As a preliminary job step, we should check the info table and make sure we support the version that is installed.

ao508 · 2016-09-01T16:25:39Z

...ter/src/main/java/org/cbio/portal/pipelines/importer/config/clinical/ClinicalDataReader.java

+
+    @Override
+    public void open(ExecutionContext executionContext) throws ItemStreamException {
+        this.cancerStudy = (CancerStudy) executionContext.get("cancerStudy");        


@n1zea144 here's where I get the cancer study object from the execution context for clinical data

pieterlukasse · 2016-09-06T08:20:45Z

@angelicaochoa @n1zea144 can we also add a docs folder with some extra information like a high level design (e.g. maybe a small diagram of the intended workflow, what are the intended layers and how they work together)? This will help me review the code as well.

Regarding unit tests and integration tests: are you planning to add this to a next PR? The previous importer code had quite a good code coverage, we should aim to keep and improve that 😉 For integration tests, see cBioPortal/cbioportal#1513.

pieterlukasse · 2016-09-06T08:35:48Z

@n1zea144 : @ersinciftci is moving to java 8 in cBioPortal/cbioportal#1633
Shall we do the same here?

pieterlukasse · 2016-09-06T10:21:38Z

I think there might be something wrong with the pom.xml files as well...at least my IDE (Eclipse) doesn't recognize the project structure very well (have you experienced this as well?)

ao508 · 2016-09-06T17:28:57Z

@pieterlukasse I'll be addressing all of these as I go along after I finish incorporating the new persistence/model repo as a dependency. To address specific questions you had:

project structure: I don't think I've had issues. My guess is that maybe I've not committed some changes yet? I've been a little sidetracked with the new persistence repo.
unit tests: I'll be incorporating these as I go along, no worries! This PR is more of an intro to the spring batch tech I'll be using for the importer.
docs: once I get new persistence repo incorporated and working with what I have going on so far then I can put together some docs with the intended workflow and any other extra info as well.
java 8: just got the go ahead from @n1zea144 to move to java 8

pieterlukasse · 2016-09-12T07:05:13Z

Thanks @angelicaochoa, looking forward to these points in subsequent. I would say that the documentation of the intended workflow is the most important for now.

ao508 · 2016-09-27T19:41:54Z

@pieterlukasse I've added some documentation in the docs folder outline the intended workflow and details on the steps that make up the batch importer job. Let me know if you have any suggestions for the documentation whenever you have a chance. Thanks!

sheridancbio · 2016-11-03T15:31:38Z

One commit with 87 files and 17K lines of code. Awesome! ⛰

n1zea144

First off, these is a great effort on your part - it was a lot of code/logic to port from the existing scripts package and put into the spring batch framework. Well done. Having said that, I think we should not let PR's comprise such a large amount of code. In this case, we probably should have broken down this PR by datatype.

General speaking I think we need to make this codebase more amenable to code changes/additions. In particular, I think the steps required and the conventions to follow to add support for new datatypes should be as clear and concise as possible. One way to do this may be to organize the code by datatype - as previously explored, and maybe we should revisit this approach. In addition, I think we need to clearly defined the interface between tasklets/readers/processors/writers and the MultiKeyMap used to carry data between jobs/steps and also clearly define the keys used. I wonder if a map is the best data structure for this purpose. Maybe that is an implementation detail that should not exposed to the user.

I like the reuse of GeneticProfileTasklet, ProfileMetadataTasklet, ProfileDataReader/Processor/Writer/Listener for common datatypes. I think this can be carried further. For example, the difference between StructuralMetadataTasklet and MutationDataTasklet is minimal (as well as others).

Also, is the code more maintainable/clearer if various database insertions (like phospho-gene or cna events) occur in their own tasklets instead of within the ProfileDataProcessor.

I like that you organized the *Step classes by type, but during the code review I found myself jumping around the various Step files to put together all the components of a flow. Maybe they should be organized by flow instead of type? I also though that the flow of data subtypes (like affy flow calling affy2 flow) was a bit confusing. I thought if all the subtypes were brought into the top level flow, it would be easier to follow.

Lastly, I noticed alot of screening of records happening within both readers and processors? Is it necessary to do this in both places? Is it necessary to do this at all assuming the validator will be running?

Most of the comments below, at the source level, reflect the sentiments just expressed. In some cases, I've pointed out where I thought a routine could be broken down into smaller ones. I've also pointed out some desired updates to copyright notices or package paths/names.

Oh, I assume there is a separate PR for the changes to the cBioPortal persistence/model modules?

n1zea144 · 2016-11-15T16:29:20Z

importer/src/main/java/org/cbio/portal/pipelines/importer/ImporterPipeline.java

+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+*/
+
+package org.cbio.portal.pipelines.importer;


I would change all packages to be consistent with the primary portal project. I would also shorten it: org.cbioportal.importer.

n1zea144 · 2016-11-15T16:30:30Z

importer/src/main/java/org/cbio/portal/pipelines/importer/ImporterPipeline.java

@@ -0,0 +1,130 @@
+/*
+ * Copyright (c) 2015 Memorial Sloan-Kettering Cancer Center.


Please update all copyright -- 2016

n1zea144 · 2016-11-15T16:35:55Z

importer/src/main/java/org/cbio/portal/pipelines/importer/ImporterPipeline.java

+        System.exit(exitStatus);
+    }
+
+    private static void launchImporterJob(String[] args, String stagingDirectory) throws Exception {


Lets remove all references to stagingDirectory - replace with studyDirectory.

n1zea144 · 2016-11-15T16:38:31Z

importer/src/main/java/org/cbio/portal/pipelines/importer/ImporterPipeline.java

+    }
+
+    private static void launchDeleteStudyJob(String[] args, String cancerStudyIdentifier) throws Exception {
+        SpringApplication app = new SpringApplication(ImporterPipeline.class);


Minor formatting issue - none of the other methods in this class have a space following the method signature.

n1zea144 · 2016-11-15T16:45:35Z

importer/src/main/java/org/cbio/portal/pipelines/importer/config/BatchConfiguration.java

+ */
+@Configuration
+@EnableBatchProcessing
+@ComponentScan("org.mskcc.cbio.persistence.jdbc")


Not sure where this is coming from yet, but I don't think we should reference an mskcc package from this code.

n1zea144 · 2016-11-16T17:02:19Z

.../src/main/java/org/cbio/portal/pipelines/importer/config/processor/ProfileDataProcessor.java

+     * @param residue
+     * @return List<Gene>
+     */
+    private List<Gene> importPhosphoGene(List<Gene> genes, String residue) {


Should something like this be put into its own step?

n1zea144 · 2016-11-16T17:04:12Z

importer/src/main/java/org/cbio/portal/pipelines/importer/config/writer/ProfileDataWriter.java

+    public void close() throws ItemStreamException {}
+
+    @Override
+    public void write(List<? extends ProfileDataRecord> list) throws Exception {


This routine should be broken down into smaller components.

n1zea144 · 2016-11-16T18:30:41Z

.../src/main/java/org/cbio/portal/pipelines/importer/config/processor/ProfileDataProcessor.java

+     * @param profileDataRecord
+     * @return List<CnaEvent>
+     */
+    private List<CnaEvent> loadCnaEvents(ProfileDataRecord profileDataRecord) {


Should we do this here?

n1zea144 · 2016-11-16T18:33:21Z

...ter/src/main/java/org/cbio/portal/pipelines/importer/config/step/GeneExpressionDataStep.java

+                    .on("RUN").to(loadGeneExpressionGeneticProfile())
+                    .next(loadGeneExpressionMetadata())
+                    .next(geneExpressionStepBuilder("geneExpressionAffymetrixStep"))
+                    .next(geneExpressionAffymetrixZscoresStep())


As previously commented, does the next datatype have to be nested in a previous one - can these be moved to the main step-flow?

n1zea144 · 2016-11-16T18:46:01Z

importer/src/main/java/org/cbio/portal/pipelines/importer/config/reader/MutationDataReader.java

+     * @param mafRecord
+     * @return MafRecord
+     */
+    private MafRecord screenMafRecord(MafRecord mafRecord) {


Is this necessary if data has been validated?

General * added main class * added some data models (rest will be added as needed) * added a data file utils class (may not need ultimately) * added .gitignore * Standardized SQL statements and named parameters type * Updated DB version in master pom.xml * Datafiles for each datatype will be determined by metafile `data_filename` property. Special cases include `clinical-supp` where the same metafile is used as `clinical`, therefore default data filenames are also searched for in these cases. Other examples where the same meta filename is used for different datatypes include: `mutation-germline` and `mutation-manual`, which share the same meta filename as `mutation`. Deleting cancer studies: * Added option to delete cancer study given a cancer study identifier * Added tasklet for deleting cancer study * Added job for deleting cancer study Refactored importer: * removed persistence classes and models * using cbioportal/persistence repo instead Configurations * batch configuration, data source configuration, and datatype configurations added * step execution configurations for importing will be organized by datatype Job Execution: * added checks to make sure cancer study meta file exists and that the data was loaded before running rest of importer pipeline DB Compatibility: * Added db schema check to be used by job decider to determine whether or not to run the job Dao classes and repositories: * Created interfaces for Dao classes * Moved sql and database interactions to Dao classes Cancer Study: * moved step/flow configuration out of `BatchConfiguration` Clinical Data * step ex. configurations added for clinical data * all clinical datatypes use the same step components - execution is decided by step name * Updated clinical data writer and clinical data configurations * Changed import of new patients/samples to single imports vs batch * Added support for clinical supp files * Added support for bcr clinical files Mutation Data: * added a composite maf record class to hold all possible fields in a maf record * Added step configuration for importing mutation data * Added a reader, processor, writer, listener, and tasklets for mutation data * Added utils classes for mutation data processing Profile (Tab-delim) data: * Added a tasklet for loading profile metadata from datafile (header, non-case id columns, map of sample stable id to internal id, normal case id columns) * Added a reader for loading all profile (tab-delim) records from a profile datafile * Added general model for loading profile data from datafiles * Added a composite profile record model for supporting all profile (tab-delim) datatypes when sending data to processor and writer * Added a writer for all tab-delim data CNA data: * Added CNA data step configuration * Added a CNA data processor Protein-level data: * Added protein-level data step configuration Gene expression data: * Added gene expression data step configuration * TO-DO: figure out unknown datatype metadata for `mrna-seq-rsem` and `mrna-seq-fcount` Fusion data: * Added fusion data step configuration * Added reader and processor for fusion data Methylation data: * Added methylation data step configuration Mutsig data: * Added mutsig data step configuration * Added a tasklet for loading mutsig metadata * Added a reader for loading mutsig data from datafile * Added a writer for importing mutsig records into db * Added a model for mutsig records Timeline data: * Added timeline data step configuration * Added a tasklet for loading timeline metadata * Added a reader for loading timeline data from datafiles * Added a processor for transforming timeline records into clinical events and clinical event data * Added a writer for importer timeline records into db * Added a model for timeline records Copy number segment data: * Added copy number seg data step configuration * Added a tasklet for loading copy number seg metadata and import a copy number seg file record * Added a reader for loading copy number seg data from datafiles * Added a writer for importing copy number seg records into db * Added a model for copy number seg records Gistic data: * Added gistic data step configuration * Added a tasklet for loading gistic metadata * Added a reader for loading gistic data from datafiles * Added a writer for importing gistic records into db * Added a model for gistic records Structural variant data: * Added structural variant data step configuration * Added a tasklet for loading structural variant metadata * Added a reader for loading structural variant data from datafiles * Added a writer for importing structural variant records into db * Added a model for structural variant records Case lists: * Added a reader for loading case lists from case list files * Added a writer for import case lists into db Added documentation for: * main batch importer job * cancer study step * clinical data step * mutation data step * CNA data step * protein-level step * gene expression step * fusion data step * methylation data step * mutsig data step * timeline data step * copy number seg data step * gistic data step * structural variant data step * case list step * Added documentation: File-Formats.md (same as current version in cBioPortal/cbioportal) Moved entry-skipping logic out of readers and into processors to improve performance. * entries that return as null from processors will automatically be filtered out of the list of entries that get sent to the writers * number of entries skipped can be retrieved from stepExecution.getFilterCount() Signed-off-by: Angelica Ochoa <[email protected]>

ao508 force-pushed the spring-importer branch 4 times, most recently from 9907a40 to 097b924 Compare August 23, 2016 15:35

n1zea144 assigned n1zea144, morungos and pieterlukasse Aug 23, 2016

n1zea144 reviewed Aug 23, 2016
View reviewed changes

ao508 force-pushed the spring-importer branch from 097b924 to 90baed3 Compare August 24, 2016 21:33

ao508 reviewed Sep 1, 2016
View reviewed changes

ao508 force-pushed the spring-importer branch from 90baed3 to a1f6ab0 Compare September 9, 2016 21:22

ao508 force-pushed the spring-importer branch 2 times, most recently from bbae622 to 8580cb7 Compare September 27, 2016 19:39

ao508 force-pushed the spring-importer branch 5 times, most recently from dec5551 to 31667a5 Compare October 4, 2016 18:41

ao508 force-pushed the spring-importer branch 4 times, most recently from 13a025f to ceaf880 Compare October 5, 2016 20:51

ao508 force-pushed the spring-importer branch 4 times, most recently from b2694ff to 1483028 Compare October 12, 2016 15:50

ao508 force-pushed the spring-importer branch 5 times, most recently from a5984d6 to 9f594ed Compare October 19, 2016 21:23

sheridancbio self-assigned this Oct 20, 2016

ao508 force-pushed the spring-importer branch from 9f594ed to 88881df Compare October 25, 2016 18:42

ao508 force-pushed the spring-importer branch from 88881df to 468f802 Compare November 4, 2016 17:59

n1zea144 reviewed Nov 16, 2016

View reviewed changes

ao508 force-pushed the spring-importer branch from 468f802 to 87f4ebe Compare June 14, 2017 16:13

ao508 mentioned this pull request Mar 8, 2019

Standalone Importer w/Spring Batch cBioPortal/GSoC#68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial commit - Spring Batch Importer Pipeline #1

Initial commit - Spring Batch Importer Pipeline #1

ao508 commented Aug 18, 2016 •

edited

Loading

n1zea144 Aug 23, 2016

ao508 Sep 1, 2016

pieterlukasse commented Sep 6, 2016 •

edited

Loading

pieterlukasse commented Sep 6, 2016

pieterlukasse commented Sep 6, 2016

ao508 commented Sep 6, 2016 •

edited

Loading

pieterlukasse commented Sep 12, 2016

ao508 commented Sep 27, 2016

sheridancbio commented Nov 3, 2016

n1zea144 left a comment •

edited

Loading

n1zea144 Nov 15, 2016

n1zea144 Nov 15, 2016

n1zea144 Nov 15, 2016

n1zea144 Nov 15, 2016

n1zea144 Nov 15, 2016

n1zea144 Nov 16, 2016

n1zea144 Nov 16, 2016

n1zea144 Nov 16, 2016

n1zea144 Nov 16, 2016

n1zea144 Nov 16, 2016

		@@ -0,0 +1,130 @@
		/*
		* Copyright (c) 2015 Memorial Sloan-Kettering Cancer Center.

Initial commit - Spring Batch Importer Pipeline #1

Are you sure you want to change the base?

Initial commit - Spring Batch Importer Pipeline #1

Conversation

ao508 commented Aug 18, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pieterlukasse commented Sep 6, 2016 • edited Loading

pieterlukasse commented Sep 6, 2016

pieterlukasse commented Sep 6, 2016

ao508 commented Sep 6, 2016 • edited Loading

pieterlukasse commented Sep 12, 2016

ao508 commented Sep 27, 2016

sheridancbio commented Nov 3, 2016

n1zea144 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ao508 commented Aug 18, 2016 •

edited

Loading

pieterlukasse commented Sep 6, 2016 •

edited

Loading

ao508 commented Sep 6, 2016 •

edited

Loading

n1zea144 left a comment •

edited

Loading