Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit - Spring Batch Importer Pipeline #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ao508
Copy link

@ao508 ao508 commented Aug 18, 2016

General

  • added main class
  • added some data models (rest will be added as needed)
  • added a data file utils class (may not need ultimately)
  • added .gitignore
  • Standardized SQL statements and named parameters type
  • Updated DB version in master pom.xml
  • Datafiles for each datatype will be determined by metafile data_filename property. Special cases include clinical-supp where the same metafile is used as clinical, therefore default data filenames are also searched for in these cases. Other examples where the same meta filename is used for different datatypes include: mutation-germline and mutation-manual, which share the same meta filename as mutation.

Deleting cancer studies:

  • Added option to delete cancer study given a cancer study identifier
  • Added tasklet for deleting cancer study
  • Added job for deleting cancer study

Refactored importer:

  • removed persistence classes and models
  • using cbioportal/persistence repo instead

Configurations

  • batch configuration, data source configuration, and datatype configurations added
  • step execution configurations for importing will be organized by datatype

Job Execution:

  • added checks to make sure cancer study meta file exists and that the data was loaded before running rest of importer pipeline

DB Compatibility:

  • Added db schema check to be used by job decider to determine whether or not to run the job

Dao classes and repositories:

  • Created interfaces for Dao classes
  • Moved sql and database interactions to Dao classes

Cancer Study:

  • moved step/flow configuration out of BatchConfiguration

Clinical Data

  • step ex. configurations added for clinical data
  • all clinical datatypes use the same step components - execution is decided by step name
  • Updated clinical data writer and clinical data configurations
  • Changed import of new patients/samples to single imports vs batch
  • Added support for clinical supp files
  • Added support for bcr clinical files

Mutation Data:

  • added a composite maf record class to hold all possible fields in a maf record
  • Added step configuration for importing mutation data
  • Added a reader, processor, writer, listener, and tasklets for mutation data
  • Added utils classes for mutation data processing

Profile (Tab-delim) data:

  • Added a tasklet for loading profile metadata from datafile (header, non-case id columns, map of sample stable id to internal id, normal case id columns)
  • Added a reader for loading all profile (tab-delim) records from a profile datafile
  • Added general model for loading profile data from datafiles
  • Added a composite profile record model for supporting all profile (tab-delim) datatypes when sending data to processor and writer
  • Added a writer for all tab-delim data

CNA data:

  • Added CNA data step configuration
  • Added a CNA data processor

Protein-level data:

  • Added protein-level data step configuration

Gene expression data:

  • Added gene expression data step configuration
  • TO-DO: figure out unknown datatype metadata for mrna-seq-rsem and mrna-seq-fcount

Fusion data:

  • Added fusion data step configuration
  • Added reader and processor for fusion data

Methylation data:

  • Added methylation data step configuration

Mutsig data:

  • Added mutsig data step configuration
  • Added a tasklet for loading mutsig metadata
  • Added a reader for loading mutsig data from datafile
  • Added a writer for importing mutsig records into db
  • Added a model for mutsig records

Timeline data:

  • Added timeline data step configuration
  • Added a tasklet for loading timeline metadata
  • Added a reader for loading timeline data from datafiles
  • Added a processor for transforming timeline records into clinical events and clinical event data
  • Added a writer for importer timeline records into db
  • Added a model for timeline records

Copy number segment data:

  • Added copy number seg data step configuration
  • Added a tasklet for loading copy number seg metadata and import a copy number seg file record
  • Added a reader for loading copy number seg data from datafiles
  • Added a writer for importing copy number seg records into db
  • Added a model for copy number seg records

Gistic data:

  • Added gistic data step configuration
  • Added a tasklet for loading gistic metadata
  • Added a reader for loading gistic data from datafiles
  • Added a writer for importing gistic records into db
  • Added a model for gistic records

Structural variant data:

  • Added structural variant data step configuration
  • Added a tasklet for loading structural variant metadata
  • Added a reader for loading structural variant data from datafiles
  • Added a writer for importing structural variant records into db
  • Added a model for structural variant records

Case lists:

  • Added a reader for loading case lists from case list files
  • Added a writer for import case lists into db

Added documentation for:

  • main batch importer job

  • cancer study step

  • clinical data step

  • mutation data step

  • CNA data step

  • protein-level step

  • gene expression step

  • fusion data step

  • methylation data step

  • mutsig data step

  • timeline data step

  • copy number seg data step

  • gistic data step

  • structural variant data step

  • case list step

  • Added documentation: File-Formats.md (same as current version in cBioPortal/cbioportal)

Moved entry-skipping logic out of readers and into processors to improve performance.

  • entries that return as null from processors will automatically be filtered out of the list of entries that get sent to the writers
  • number of entries skipped can be retrieved from stepExecution.getFilterCount()

Signed-off-by: Angelica Ochoa [email protected]

To-Do:

  • Add unit/integration tests
  • Moving progress monitoring from logging (maybe Spring Batch admin or other?)
  • Added job decider to check db compatibility before import can begin
  • Moved sql and db interactions to Dao classes
  • Moved to java 8
  • Add documentation for intended work flow/docs folder

Datatypes supported:

  • cancer study
  • clinical data
  • mutation data
  • clinical caises (xml)
  • timeline
  • rppa
  • rppa z-scores
  • protein quantification
  • cna
  • gene expression
  • gene expression z-scores
  • seg
  • gistic
  • methylation
  • mutsig (mutation significance)
  • fusion
  • mirna
  • mrna seq
  • structural variant

@ao508 ao508 force-pushed the spring-importer branch 4 times, most recently from 9907a40 to 097b924 Compare August 23, 2016 15:35
private static final Log LOG = LogFactory.getLog(BatchConfiguration.class);

@Bean
public Job batchImporterJob() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we talked about limiting the dependencies to just an external model module and work with the database directly. That means we do have a dependency on the db schema. As a preliminary job step, we should check the info table and make sure we support the version that is installed.


@Override
public void open(ExecutionContext executionContext) throws ItemStreamException {
this.cancerStudy = (CancerStudy) executionContext.get("cancerStudy");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n1zea144 here's where I get the cancer study object from the execution context for clinical data

@pieterlukasse
Copy link
Member

pieterlukasse commented Sep 6, 2016

@angelicaochoa @n1zea144 can we also add a docs folder with some extra information like a high level design (e.g. maybe a small diagram of the intended workflow, what are the intended layers and how they work together)? This will help me review the code as well.

Regarding unit tests and integration tests: are you planning to add this to a next PR? The previous importer code had quite a good code coverage, we should aim to keep and improve that 😉 For integration tests, see cBioPortal/cbioportal#1513.

@pieterlukasse
Copy link
Member

@n1zea144 : @ersinciftci is moving to java 8 in cBioPortal/cbioportal#1633
Shall we do the same here?

@pieterlukasse
Copy link
Member

I think there might be something wrong with the pom.xml files as well...at least my IDE (Eclipse) doesn't recognize the project structure very well (have you experienced this as well?)

@ao508
Copy link
Author

ao508 commented Sep 6, 2016

@pieterlukasse I'll be addressing all of these as I go along after I finish incorporating the new persistence/model repo as a dependency. To address specific questions you had:

  • project structure: I don't think I've had issues. My guess is that maybe I've not committed some changes yet? I've been a little sidetracked with the new persistence repo.
  • unit tests: I'll be incorporating these as I go along, no worries! This PR is more of an intro to the spring batch tech I'll be using for the importer.
  • docs: once I get new persistence repo incorporated and working with what I have going on so far then I can put together some docs with the intended workflow and any other extra info as well.
  • java 8: just got the go ahead from @n1zea144 to move to java 8

@pieterlukasse
Copy link
Member

Thanks @angelicaochoa, looking forward to these points in subsequent. I would say that the documentation of the intended workflow is the most important for now.

@ao508 ao508 force-pushed the spring-importer branch 2 times, most recently from bbae622 to 8580cb7 Compare September 27, 2016 19:39
@ao508
Copy link
Author

ao508 commented Sep 27, 2016

@pieterlukasse I've added some documentation in the docs folder outline the intended workflow and details on the steps that make up the batch importer job. Let me know if you have any suggestions for the documentation whenever you have a chance. Thanks!

@ao508 ao508 force-pushed the spring-importer branch 5 times, most recently from dec5551 to 31667a5 Compare October 4, 2016 18:41
@ao508 ao508 force-pushed the spring-importer branch 4 times, most recently from 13a025f to ceaf880 Compare October 5, 2016 20:51
@ao508 ao508 force-pushed the spring-importer branch 4 times, most recently from b2694ff to 1483028 Compare October 12, 2016 15:50
@ao508 ao508 force-pushed the spring-importer branch 5 times, most recently from a5984d6 to 9f594ed Compare October 19, 2016 21:23
@sheridancbio sheridancbio self-assigned this Oct 20, 2016
@sheridancbio
Copy link

One commit with 87 files and 17K lines of code. Awesome! ⛰

Copy link
Contributor

@n1zea144 n1zea144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, these is a great effort on your part - it was a lot of code/logic to port from the existing scripts package and put into the spring batch framework. Well done. Having said that, I think we should not let PR's comprise such a large amount of code. In this case, we probably should have broken down this PR by datatype.

General speaking I think we need to make this codebase more amenable to code changes/additions. In particular, I think the steps required and the conventions to follow to add support for new datatypes should be as clear and concise as possible. One way to do this may be to organize the code by datatype - as previously explored, and maybe we should revisit this approach. In addition, I think we need to clearly defined the interface between tasklets/readers/processors/writers and the MultiKeyMap used to carry data between jobs/steps and also clearly define the keys used. I wonder if a map is the best data structure for this purpose. Maybe that is an implementation detail that should not exposed to the user.

I like the reuse of GeneticProfileTasklet, ProfileMetadataTasklet, ProfileDataReader/Processor/Writer/Listener for common datatypes. I think this can be carried further. For example, the difference between StructuralMetadataTasklet and MutationDataTasklet is minimal (as well as others).

Also, is the code more maintainable/clearer if various database insertions (like phospho-gene or cna events) occur in their own tasklets instead of within the ProfileDataProcessor.

I like that you organized the *Step classes by type, but during the code review I found myself jumping around the various Step files to put together all the components of a flow. Maybe they should be organized by flow instead of type? I also though that the flow of data subtypes (like affy flow calling affy2 flow) was a bit confusing. I thought if all the subtypes were brought into the top level flow, it would be easier to follow.

Lastly, I noticed alot of screening of records happening within both readers and processors? Is it necessary to do this in both places? Is it necessary to do this at all assuming the validator will be running?

Most of the comments below, at the source level, reflect the sentiments just expressed. In some cases, I've pointed out where I thought a routine could be broken down into smaller ones. I've also pointed out some desired updates to copyright notices or package paths/names.

Oh, I assume there is a separate PR for the changes to the cBioPortal persistence/model modules?

* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/

package org.cbio.portal.pipelines.importer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change all packages to be consistent with the primary portal project. I would also shorten it: org.cbioportal.importer.

@@ -0,0 +1,130 @@
/*
* Copyright (c) 2015 Memorial Sloan-Kettering Cancer Center.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update all copyright -- 2016

System.exit(exitStatus);
}

private static void launchImporterJob(String[] args, String stagingDirectory) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove all references to stagingDirectory - replace with studyDirectory.

}

private static void launchDeleteStudyJob(String[] args, String cancerStudyIdentifier) throws Exception {
SpringApplication app = new SpringApplication(ImporterPipeline.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting issue - none of the other methods in this class have a space following the method signature.

*/
@Configuration
@EnableBatchProcessing
@ComponentScan("org.mskcc.cbio.persistence.jdbc")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where this is coming from yet, but I don't think we should reference an mskcc package from this code.

* @param residue
* @return List<Gene>
*/
private List<Gene> importPhosphoGene(List<Gene> genes, String residue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should something like this be put into its own step?

public void close() throws ItemStreamException {}

@Override
public void write(List<? extends ProfileDataRecord> list) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This routine should be broken down into smaller components.

* @param profileDataRecord
* @return List<CnaEvent>
*/
private List<CnaEvent> loadCnaEvents(ProfileDataRecord profileDataRecord) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this here?

.on("RUN").to(loadGeneExpressionGeneticProfile())
.next(loadGeneExpressionMetadata())
.next(geneExpressionStepBuilder("geneExpressionAffymetrixStep"))
.next(geneExpressionAffymetrixZscoresStep())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previously commented, does the next datatype have to be nested in a previous one - can these be moved to the main step-flow?

* @param mafRecord
* @return MafRecord
*/
private MafRecord screenMafRecord(MafRecord mafRecord) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary if data has been validated?

General
* added main class
* added some data models (rest will be added as needed)
* added a data file utils class (may not need ultimately)
* added .gitignore
* Standardized SQL statements and named parameters type
* Updated DB version in master pom.xml
* Datafiles for each datatype will be determined by metafile `data_filename` property. Special cases include `clinical-supp` where the same metafile is used as `clinical`, therefore default data filenames are also searched for in these cases. Other examples where the same meta filename is used for different datatypes include: `mutation-germline` and `mutation-manual`, which share the same meta filename as `mutation`.

Deleting cancer studies:
* Added option to delete cancer study given a cancer study identifier
* Added tasklet for deleting cancer study
* Added job for deleting cancer study

Refactored importer:
* removed persistence classes and models
* using cbioportal/persistence repo instead

Configurations
* batch configuration, data source configuration, and datatype configurations added
* step execution configurations for importing will be organized by datatype

Job Execution:
* added checks to make sure cancer study meta file exists and that the data was loaded before running rest of importer pipeline

DB Compatibility:
* Added db schema check to be used by job decider to determine whether or not to run the job

Dao classes and repositories:
* Created interfaces for Dao classes
* Moved sql and database interactions to Dao classes

Cancer Study:
* moved step/flow configuration out of `BatchConfiguration`

Clinical Data
* step ex. configurations added for clinical data
* all clinical datatypes use the same step components - execution is decided by step name
* Updated clinical data writer and clinical data configurations
* Changed import of new patients/samples to single imports vs batch
* Added support for clinical supp files
* Added support for bcr clinical files

Mutation Data:
* added a composite maf record class to hold all possible fields in a maf record
* Added step configuration for importing mutation data
* Added a reader, processor, writer, listener, and tasklets for mutation data
* Added utils classes for mutation data processing

Profile (Tab-delim) data:
* Added a tasklet for loading profile metadata from datafile (header, non-case id columns, map of sample stable id to internal id, normal case id columns)
* Added a reader for loading all profile (tab-delim) records from a profile datafile
* Added general model for loading profile data from datafiles
* Added a composite profile record model for supporting all profile (tab-delim) datatypes when sending data to processor and writer
* Added a writer for all tab-delim data

CNA data:
* Added CNA data step configuration
* Added a CNA data processor

Protein-level data:
* Added protein-level data step configuration

Gene expression data:
* Added gene expression data step configuration
* TO-DO: figure out unknown datatype metadata for `mrna-seq-rsem` and `mrna-seq-fcount`

Fusion data:
* Added fusion data step configuration
* Added reader and processor for fusion data

Methylation data:
* Added methylation data step configuration

Mutsig data:
* Added mutsig data step configuration
* Added a tasklet for loading mutsig metadata
* Added a reader for loading mutsig data from datafile
* Added a writer for importing mutsig records into db
* Added a model for mutsig records

Timeline data:
* Added timeline data step configuration
* Added a tasklet for loading timeline metadata
* Added a reader for loading timeline data from datafiles
* Added a processor for transforming timeline records into clinical events and clinical event data
* Added a writer for importer timeline records into db
* Added a model for timeline records

Copy number segment data:
* Added copy number seg data step configuration
* Added a tasklet for loading copy number seg metadata and import a copy number seg file record
* Added a reader for loading copy number seg data from datafiles
* Added a writer for importing copy number seg records into db
* Added a model for copy number seg records

Gistic data:
* Added gistic data step configuration
* Added a tasklet for loading gistic metadata
* Added a reader for loading gistic data from datafiles
* Added a writer for importing gistic records into db
* Added a model for gistic records

Structural variant data:
* Added structural variant data step configuration
* Added a tasklet for loading structural variant metadata
* Added a reader for loading structural variant data from datafiles
* Added a writer for importing structural variant records into db
* Added a model for structural variant records

Case lists:
* Added a reader for loading case lists from case list files
* Added a writer for import case lists into db

Added documentation for:
* main batch importer job
* cancer study step
* clinical data step
* mutation data step
* CNA data step
* protein-level step
* gene expression step
* fusion data step
* methylation data step
* mutsig data step
* timeline data step
* copy number seg data step
* gistic data step
* structural variant data step
* case list step

* Added documentation: File-Formats.md (same as current version in cBioPortal/cbioportal)

Moved entry-skipping logic out of readers and into processors to improve performance.
* entries that return as null from processors will automatically be filtered out of the list of entries that get sent to the writers
* number of entries skipped can be retrieved from stepExecution.getFilterCount()

Signed-off-by: Angelica Ochoa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants