-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial commit - Spring Batch Importer Pipeline #1
base: master
Are you sure you want to change the base?
Conversation
9907a40
to
097b924
Compare
private static final Log LOG = LogFactory.getLog(BatchConfiguration.class); | ||
|
||
@Bean | ||
public Job batchImporterJob() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we talked about limiting the dependencies to just an external model module and work with the database directly. That means we do have a dependency on the db schema. As a preliminary job step, we should check the info table and make sure we support the version that is installed.
|
||
@Override | ||
public void open(ExecutionContext executionContext) throws ItemStreamException { | ||
this.cancerStudy = (CancerStudy) executionContext.get("cancerStudy"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n1zea144 here's where I get the cancer study object from the execution context for clinical data
@angelicaochoa @n1zea144 can we also add a Regarding unit tests and integration tests: are you planning to add this to a next PR? The previous importer code had quite a good code coverage, we should aim to keep and improve that 😉 For integration tests, see cBioPortal/cbioportal#1513. |
@n1zea144 : @ersinciftci is moving to java 8 in cBioPortal/cbioportal#1633 |
I think there might be something wrong with the |
@pieterlukasse I'll be addressing all of these as I go along after I finish incorporating the new persistence/model repo as a dependency. To address specific questions you had:
|
90baed3
to
a1f6ab0
Compare
Thanks @angelicaochoa, looking forward to these points in subsequent. I would say that the documentation of the intended workflow is the most important for now. |
bbae622
to
8580cb7
Compare
@pieterlukasse I've added some documentation in the |
dec5551
to
31667a5
Compare
13a025f
to
ceaf880
Compare
b2694ff
to
1483028
Compare
a5984d6
to
9f594ed
Compare
9f594ed
to
88881df
Compare
One commit with 87 files and 17K lines of code. Awesome! ⛰ |
88881df
to
468f802
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First off, these is a great effort on your part - it was a lot of code/logic to port from the existing scripts package and put into the spring batch framework. Well done. Having said that, I think we should not let PR's comprise such a large amount of code. In this case, we probably should have broken down this PR by datatype.
General speaking I think we need to make this codebase more amenable to code changes/additions. In particular, I think the steps required and the conventions to follow to add support for new datatypes should be as clear and concise as possible. One way to do this may be to organize the code by datatype - as previously explored, and maybe we should revisit this approach. In addition, I think we need to clearly defined the interface between tasklets/readers/processors/writers and the MultiKeyMap used to carry data between jobs/steps and also clearly define the keys used. I wonder if a map is the best data structure for this purpose. Maybe that is an implementation detail that should not exposed to the user.
I like the reuse of GeneticProfileTasklet, ProfileMetadataTasklet, ProfileDataReader/Processor/Writer/Listener for common datatypes. I think this can be carried further. For example, the difference between StructuralMetadataTasklet and MutationDataTasklet is minimal (as well as others).
Also, is the code more maintainable/clearer if various database insertions (like phospho-gene or cna events) occur in their own tasklets instead of within the ProfileDataProcessor.
I like that you organized the *Step classes by type, but during the code review I found myself jumping around the various Step files to put together all the components of a flow. Maybe they should be organized by flow instead of type? I also though that the flow of data subtypes (like affy flow calling affy2 flow) was a bit confusing. I thought if all the subtypes were brought into the top level flow, it would be easier to follow.
Lastly, I noticed alot of screening of records happening within both readers and processors? Is it necessary to do this in both places? Is it necessary to do this at all assuming the validator will be running?
Most of the comments below, at the source level, reflect the sentiments just expressed. In some cases, I've pointed out where I thought a routine could be broken down into smaller ones. I've also pointed out some desired updates to copyright notices or package paths/names.
Oh, I assume there is a separate PR for the changes to the cBioPortal persistence/model modules?
* along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
*/ | ||
|
||
package org.cbio.portal.pipelines.importer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change all packages to be consistent with the primary portal project. I would also shorten it: org.cbioportal.importer.
@@ -0,0 +1,130 @@ | |||
/* | |||
* Copyright (c) 2015 Memorial Sloan-Kettering Cancer Center. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update all copyright -- 2016
System.exit(exitStatus); | ||
} | ||
|
||
private static void launchImporterJob(String[] args, String stagingDirectory) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove all references to stagingDirectory - replace with studyDirectory.
} | ||
|
||
private static void launchDeleteStudyJob(String[] args, String cancerStudyIdentifier) throws Exception { | ||
SpringApplication app = new SpringApplication(ImporterPipeline.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor formatting issue - none of the other methods in this class have a space following the method signature.
*/ | ||
@Configuration | ||
@EnableBatchProcessing | ||
@ComponentScan("org.mskcc.cbio.persistence.jdbc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure where this is coming from yet, but I don't think we should reference an mskcc package from this code.
* @param residue | ||
* @return List<Gene> | ||
*/ | ||
private List<Gene> importPhosphoGene(List<Gene> genes, String residue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should something like this be put into its own step?
public void close() throws ItemStreamException {} | ||
|
||
@Override | ||
public void write(List<? extends ProfileDataRecord> list) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This routine should be broken down into smaller components.
* @param profileDataRecord | ||
* @return List<CnaEvent> | ||
*/ | ||
private List<CnaEvent> loadCnaEvents(ProfileDataRecord profileDataRecord) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this here?
.on("RUN").to(loadGeneExpressionGeneticProfile()) | ||
.next(loadGeneExpressionMetadata()) | ||
.next(geneExpressionStepBuilder("geneExpressionAffymetrixStep")) | ||
.next(geneExpressionAffymetrixZscoresStep()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As previously commented, does the next datatype have to be nested in a previous one - can these be moved to the main step-flow?
* @param mafRecord | ||
* @return MafRecord | ||
*/ | ||
private MafRecord screenMafRecord(MafRecord mafRecord) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary if data has been validated?
General * added main class * added some data models (rest will be added as needed) * added a data file utils class (may not need ultimately) * added .gitignore * Standardized SQL statements and named parameters type * Updated DB version in master pom.xml * Datafiles for each datatype will be determined by metafile `data_filename` property. Special cases include `clinical-supp` where the same metafile is used as `clinical`, therefore default data filenames are also searched for in these cases. Other examples where the same meta filename is used for different datatypes include: `mutation-germline` and `mutation-manual`, which share the same meta filename as `mutation`. Deleting cancer studies: * Added option to delete cancer study given a cancer study identifier * Added tasklet for deleting cancer study * Added job for deleting cancer study Refactored importer: * removed persistence classes and models * using cbioportal/persistence repo instead Configurations * batch configuration, data source configuration, and datatype configurations added * step execution configurations for importing will be organized by datatype Job Execution: * added checks to make sure cancer study meta file exists and that the data was loaded before running rest of importer pipeline DB Compatibility: * Added db schema check to be used by job decider to determine whether or not to run the job Dao classes and repositories: * Created interfaces for Dao classes * Moved sql and database interactions to Dao classes Cancer Study: * moved step/flow configuration out of `BatchConfiguration` Clinical Data * step ex. configurations added for clinical data * all clinical datatypes use the same step components - execution is decided by step name * Updated clinical data writer and clinical data configurations * Changed import of new patients/samples to single imports vs batch * Added support for clinical supp files * Added support for bcr clinical files Mutation Data: * added a composite maf record class to hold all possible fields in a maf record * Added step configuration for importing mutation data * Added a reader, processor, writer, listener, and tasklets for mutation data * Added utils classes for mutation data processing Profile (Tab-delim) data: * Added a tasklet for loading profile metadata from datafile (header, non-case id columns, map of sample stable id to internal id, normal case id columns) * Added a reader for loading all profile (tab-delim) records from a profile datafile * Added general model for loading profile data from datafiles * Added a composite profile record model for supporting all profile (tab-delim) datatypes when sending data to processor and writer * Added a writer for all tab-delim data CNA data: * Added CNA data step configuration * Added a CNA data processor Protein-level data: * Added protein-level data step configuration Gene expression data: * Added gene expression data step configuration * TO-DO: figure out unknown datatype metadata for `mrna-seq-rsem` and `mrna-seq-fcount` Fusion data: * Added fusion data step configuration * Added reader and processor for fusion data Methylation data: * Added methylation data step configuration Mutsig data: * Added mutsig data step configuration * Added a tasklet for loading mutsig metadata * Added a reader for loading mutsig data from datafile * Added a writer for importing mutsig records into db * Added a model for mutsig records Timeline data: * Added timeline data step configuration * Added a tasklet for loading timeline metadata * Added a reader for loading timeline data from datafiles * Added a processor for transforming timeline records into clinical events and clinical event data * Added a writer for importer timeline records into db * Added a model for timeline records Copy number segment data: * Added copy number seg data step configuration * Added a tasklet for loading copy number seg metadata and import a copy number seg file record * Added a reader for loading copy number seg data from datafiles * Added a writer for importing copy number seg records into db * Added a model for copy number seg records Gistic data: * Added gistic data step configuration * Added a tasklet for loading gistic metadata * Added a reader for loading gistic data from datafiles * Added a writer for importing gistic records into db * Added a model for gistic records Structural variant data: * Added structural variant data step configuration * Added a tasklet for loading structural variant metadata * Added a reader for loading structural variant data from datafiles * Added a writer for importing structural variant records into db * Added a model for structural variant records Case lists: * Added a reader for loading case lists from case list files * Added a writer for import case lists into db Added documentation for: * main batch importer job * cancer study step * clinical data step * mutation data step * CNA data step * protein-level step * gene expression step * fusion data step * methylation data step * mutsig data step * timeline data step * copy number seg data step * gistic data step * structural variant data step * case list step * Added documentation: File-Formats.md (same as current version in cBioPortal/cbioportal) Moved entry-skipping logic out of readers and into processors to improve performance. * entries that return as null from processors will automatically be filtered out of the list of entries that get sent to the writers * number of entries skipped can be retrieved from stepExecution.getFilterCount() Signed-off-by: Angelica Ochoa <[email protected]>
General
data_filename
property. Special cases includeclinical-supp
where the same metafile is used asclinical
, therefore default data filenames are also searched for in these cases. Other examples where the same meta filename is used for different datatypes include:mutation-germline
andmutation-manual
, which share the same meta filename asmutation
.Deleting cancer studies:
Refactored importer:
Configurations
Job Execution:
DB Compatibility:
Dao classes and repositories:
Cancer Study:
BatchConfiguration
Clinical Data
Mutation Data:
Profile (Tab-delim) data:
CNA data:
Protein-level data:
Gene expression data:
mrna-seq-rsem
andmrna-seq-fcount
Fusion data:
Methylation data:
Mutsig data:
Timeline data:
Copy number segment data:
Gistic data:
Structural variant data:
Case lists:
Added documentation for:
main batch importer job
cancer study step
clinical data step
mutation data step
CNA data step
protein-level step
gene expression step
fusion data step
methylation data step
mutsig data step
timeline data step
copy number seg data step
gistic data step
structural variant data step
case list step
Added documentation: File-Formats.md (same as current version in cBioPortal/cbioportal)
Moved entry-skipping logic out of readers and into processors to improve performance.
Signed-off-by: Angelica Ochoa [email protected]
To-Do:
docs
folderDatatypes supported: