The repository contains the script for performing multinomial regression with lasso to identify and visualize anchors with metadata-dependent target fraction
directory
: SPLASH run directorymetadata_file
: path to metadata filerun_name
: prefix for the output folderdatatype
: can be either10x
when input data is 10x ornon10x
when input data is not 10xinput_anchors
: list of specific anchorssample_fraction_cutoff
: cutoff for the minimum fraction of samples each anchor should be presentnum_anchors
: maximum number of anchors to be tested
- providing a list of anchors: each anchor in the list will be considered for the GLM test
- specifying
sample_fraction_cutoff
andnum_anchors
: In this case, among the anchors inresult.after_correction.scores.csv
that are in at leastsample_fraction_cutoff
of samples, and haveavg_hamming_distance_max_target
>5, the topnum_anchors
with the highest effect size are selected to be tested by the GLM.
- specific list of anchors:
Rscript SPLASH_spervised_anchor.R /oak/stanford/groups/horence/Roozbeh/SPLASH_10x/runs/CCLE_all/ /oak/stanford/groups/horence/Roozbeh/SPLASH_10x/utility_files/CCLE_metadata_modified.tsv one_unaligned_target_anchors /oak/stanford/groups/horence/Roozbeh/SPLASH_10x/runs/CCLE_all/concatentaion_based_classified_compactors_one_SJ_one_unaligned.tsv
- anchors with the highest effect size:
Rscript SPLASH_spervised_anchor.R /oak/stanford/groups/horence/Roozbeh/SPLASH_10x/runs/CCLE_all/ /oak/stanford/groups/horence/Roozbeh/SPLASH_10x/utility_files/CCLE_metadata_modified.tsv one_unaligned_target_anchors 0.4 20000
The only requirement for metadata file is that the first column and second columns must be for sample names and the assigned class/group/category/celltype to each sample. Also, the first row must be column names (column names can be anything). The two columns are subsequently renamed as "sample_name" and "group" in the script. Below is an example metadata file:
sample_name type
SRR8788980 carcinoma
SRR8788981 melanoma
SRR8788982 lymphoma
SRR8788983 carcinoma
SRR8657060 carcinoma
Meta data file for 10x data must have an extra column (second column) for cell barcodes. Other requirements are the same as non-10x data:
sample_name cell cell_type
S2 AAACCCAAGACACACG AT2
S1 AAACCCAAGCCATGCC NK
S1 AAACCCAAGGGAACAA Mac
S4 AAACCCAAGGGTTGCA NK
The output directory for the GLM test is: directory
/run_name
_supervised_metadata/.
GLM_supervised_anchors.tsv
: the output file containing anchors with non-zero GLM_coefficients whose largest GLM coefficient is greater than 1.plots
: this subfolder contains the plots for visualizing the anchors with metadata-dependent target fraction varaition. For each anchor, two plots will be generated in a pdf named as anchor sequence, a box plot for showing the fraction of target1 per sample grouped by the category and a scatterplot for showing the counts for target1 and target2 per sample colorcoded by the metadata category.
Please contact Roozbeh Dehghannasiri ([email protected]).