This workflow performs an aggregate association analysis of genotype data with a single phenotype. The primary code is written in R using the GENESIS package for model fitting and association testing. The workflow can either generate a null model from phenotype and relatedness data or use a pregenerated null model. This workflow outputs both the full association statistics (pvalues, test statistics) and summary plots (manhatten and QQ)
This workflow is produced and maintained by the Manning Lab. Contributing authors include:
- Tim Majarian ([email protected])
- Alisa Manning ([email protected]).
This workflow is written with a WDL wrapper for portability. Each task can be run individually from the base script or within the full workflow.
All code and necessary packages are available through a Docker image as well as through the Github repository.
Italics indicate optional inputs. Each function also requires the user to specify allocated memory and disk space.
This function generates a null model to be used in association testing in Genesis.
Inputs:
- genotype_file : genotype data for all samples, this is only used to ensure the correct ordering of phenotype data, it need not contain all genotypes to be tested in aggAssocTest (GDS file)
- phenotype_file : phenotype data for all samples to be included in analysis (CSV or TSV file)
- outcome_name : the outcome to be tested (string)
- outcome_type : the type of outcome being tested (dichotomous or continuous)
- covariates_string : covariates to condition on in linear mixed modeling (comma separated string, default = None)
- sample_file : a file containing a list of sample ids (matching the genotype and phenotype files) to be included, one per line (.txt, optional)
- label : prefix for output filename (string)
- kinship_matrix : relatedness measures for all samples (CSV or TSV file)
- id_col : column name of id column in phenotype file (string)
Outputs:
- model : generated null model (.RDa)
- log : log file containing paths to inputs and memory and cpu usage
This function performs an association test to generate p-values for each aggregation unit included.
Inputs:
- genotype_file : a genotype file containing data for all samples and all variants to be tested (GDS file)
- null_file : output of fitNull or a pregenerated null model (.RDa)
- group_file : RData or csv/tsv file with groups to include in analysis, if RData, must be saved as a list with unique names, each entry as a data frame with at least columns for variant.id, position, chromosome, ref, allele, nAlleles, allele.index. If csv, must have at least columns for group_id, position, chromosome, ref, alt (csv or RData)
- label : prefix for output filename (string)
- test : SKAT or Burden (string)
- pval : if SKAT: davies, kuonen, or liu; if Burden: Score, Wald, or Firth (string)
- weights : parameters of beta distribution for variant weights (comma separated string in form: "1,25")
- force_maf : flag to force the minor allele to the least frequent allele (default = True)
Outputs:
- assoc : an RData file of associations results (.RData)
- log : log file containing paths to inputs and memory and cpu usage
- groups : csv file of actualy groups used for analysis. Some reasonable defaults are set during execution: unless force_maf == F, the alternate allele may be changed so that MAF(alt) < MAF(ref). Some variants may also be removed from groups if they do not occuring within the sample being tested. This will be reflected in the groups output.
Generate a summary of association results including quantile-quantile and manhattan plots for variants, one each for all and only groups with cumulative minor allele count above the specified threshold. Also generates CSV files of all and the top associated variants.
Inputs:
- label : prefix for output filename (string)
- assoc_files : comma separated list of association results, output of aggAssocTest (string)
- minmac : cumulative minor allele count threshold for plotting (Int, default = 10)
Outputs:
- plots : manhatten and QQ plots (PNG)
- assoc_res : association results for all groups (CSV)
- assoc_res_variants : association results for all variants in all groups (CSV)
- mac_plots : thresholded by minmac manhatten and QQ plots (PNG)
- mac_assoc_res : association results for groups thresholded by minmac (CSV)
- mac_assoc_res_variants : association results for all variants in groups thresholded by minmac (CSV)
- log : log file containing paths to inputs and memory and cpu usage
- this_fitNull_memory : amount of memory in GB for fitNull task (int)
- this_aggAssocTest_memory : amount of memory in GB for aggAssocTest task (int)
- this_summary_memory : amount of memory in GB for summary task (int)
- this_disk : amount of disk space in GB to allot for each execution of a task (int)
This workflow helps make the approriate aggregation unit file for association testing (as input to Aggregate Association).
This workflow is produced and maintained by the Manning Lab. Contributing authors include:
- Tim Majarian ([email protected])
This workflow is written with a WDL wrapper for portability. Each task can be run individually from the base script or within the full workflow.
All code and necessary packages are available through a Docker image as well as through the Github repository.
Italics indicate optional inputs. Each function also requires the user to specify allocated memory and disk space.
This function gets variant level information for all variants falling within input regions or an optional variant file
Inputs:
- genotype_file : genotype data for all samples (GDS file)
- region_file : file containing regions of interest with a single genomic interval per line. The column format of this file should be chromosome, start, end, name, annotation with name being the proposed group id for that region. Each defined interval (or group of intervals that have the same name) defines a single aggregation unit. (CSV or TSV file)
- variant_file : optional file containing variants to be added to the interval based aggregation units. Column format should be chromosome, position, ref, alt, groiup_id, annotation. (CSV or TSV file)
- max_maf : maximum minor allele frequency to consider when aggregating variants (float)
- min_maf : minimum minor allele frequency to consider when aggregating variants (float)
- out_pref : prefix for the output file name (string)
Outputs:
- group_file : file containing variant level information for all aggregation units contained in the input genotype file (CSV file)
This function combines variant level information output from makeGroups into a single aggregation file.
Inputs:
- these_groups : output from makeGroups (array of CSV files)
- out_pref : prefix for the output file name (string)
Outputs:
- final_groups : file containing variant level information for all aggregation units over all genotype files (CSV file)