Refactored codebase #6

jeff-mandell · 2024-03-19T20:41:40Z

Hi, here's a summary of the contents of the pull request:

I organized scripts and data in accordance with R pacakge conventions. Installation of BISCUT along with all R/Python dependencies is, with luck, 1-2 lines:

	# install remotes package if not already installed
	install.packages('remotes')

	# install the forked version and any dependencies
	remotes::install_github('jeff-mandell/BISCUT-py3')

	# after merging the pull request, installation would be 
	# remotes::install_github('beroukhim-lab/BISCUT-py3')

Please let me know if this doesn't work for you. It's supposed to work, but on one system that I tested, I needed to force re-installation of miniconda in order to get BISCUT's Python dependencies to install correctly (via reticulate::install_miniconda(force = T)).

I wrapped the BISCUT workflow into R functions so that the user can do everything from the R environment. The worflow is to call make_breakpoint_files() to process a segment file (such as PANCAN_ISAR.seg.txt) and produce a breakpoint file directory, and then to call do_biscut() with the
breakpoint file directory as an input. Both functions can use parallelization when the cores argument is specified. I made some other optimizations to speed runtime. Besides printing output files, I also made the main function return peak information within R. Here is example usage:

# Assumes segment file exists in working directory.
library(BISCUT)
make_breakpoint_files(segment_file = 'PANCAN_ISAR.seg.txt', output_dir = 'my_breakpoints', cores = 8)
results = do_biscut(breakpoint_file_dir = 'my_breakpoints', results_dir = 'my_biscut_output', cores = 8)

# see ?do_biscut for documentation of output and function arguments

The output directory and reference files (abslocs, genelocs) can now be user-specified, as can parameters that could previously only be changed by editing source code (output directory, telcent_thres, amplitude_threshold, confidence interval, n bootstraps). The previously hardcoded data and parameter values are used as defaults when unspecified by the user.
The code was doing a lot of error suppression which could prevent the user from knowing about potential problems. I removed most of this and resolved any errors and warnings. There are probably still some that could be triggered by situations or system configurations that I didn't test.

Questions/issues

There appeared to be an error in the chromosome coordinates file (now at inst/extdata/SNP6_hg19_chromosome_locs_200605.txt). Every chromosome has q_start1 = centromere + offset except for chr15, where instead q_start1 = q_end1 (= 2409817111). It appears that the q_start1 value was accidentally copied from q_end1, unless there is something I don't understand about the format. I changed the value to match the other chromosomes. Please let me know if it should be changed back.
FYI, process_for_ggplot_jagged() doesn't run, partly because it wants to load files ending in plotpeaks.txt. I couldn't find any code in the project that generates such files. Therefore, do_biscut() currently just calls process_for_ggplot().
plot_fig_2() is runnable, but the plots look odd. I couldn't tell which plot(s) from the paper they are supposed to resemble. If there isn't a continuing need for the function, I could remove the filter_BISCUT_arms() call from the main function to avoid generating unused files.
Previously, there were hard-coded genes of interest--c('CDKN2A','TERT','MYC','BAP1','TERC','TP53','ARID1A','EGFR','PPM1D')--in BISCUT_peak_finding.R. I removed this declaration, as the user can instead call filter_BISCUT_knowngenes (via an R wrapper function, extract_gene_results()) on any desired genes. However, since process_for_ggplot_jagged() isn't running, I am not sure of the use case for filter_BISCUT_knowngenes().

Suggestions:

Fill in the placeholder information in the package's top-level DESCRIPTION file: title, description, and contributors. The conventions for filling in contributors are described here: https://r-pkgs.org/description.html#sec-description-authors-at-r.
Also in the DESCRIPTION, I temporarily put in version 1.2.0.9000 to indicate a development version. This should be changed to 1.3.0 (or 2.0.0, if you are into semantic versioning, or any other version number you like).
The README should be updated. I copied a lot of the descriptions of files and parameters into the R scripts, so that the information appears when a user calls ?do_biscut() or ?make_breakpoint_files(). Therefore, some of this information could be removed from the README. If you think it is a good idea, a compressed segment file, perhaps covering fewer samples than the PANCAN file, could be saved in the repository so that the README could give runnable code on an example segment file.
Most of the data from do_biscut() is currently both returned within R and printed to output files, in somewhat different formats. If you are interested, I could further prune down output data and make all output file writing optional. I could also reduce the breakpoint file directory to a single breakpoint file, and make do_biscut() accept either the single file or a data.table/data.frame version of it.

… boundary finding and breakpoint file creation; refactor scripts to reduce number of output files; put output files in user-specified output directory; allow key parameters to be passed to function calls rather than hard-coded in source files. Fix (presumed) error in chr coordinate file. Eliminate warnings and errors in R and Python.

…eproducibility

…o occurrences in input samples

…ndpos in breakpoint files

jeff-mandell added 5 commits March 15, 2024 16:43

Put essential results in do_biscut() output; tweak seed-setting for r…

0b977e5

…eproducibility

Installation fix

b35ff8d

Move peak plotting out of parallelized function; other small fixes

700a071

Avoid crash when trying to make a plot on zero peaks

a9697b1

jeff-mandell mentioned this pull request Mar 19, 2024

Clean up package structure and optimize script #5

Open

jeff-mandell added 4 commits April 2, 2024 11:20

Fix for when telcent_thres is not default

4dc84e7

Avoid crash when do_arm_gistic() runs on an amp/del tel/cent with zer…

24cc163

…o occurrences in input samples

In addition to in-arm fractional location, record absolute startpos/e…

b309bf1

…ndpos in breakpoint files

Make validate_chr_coordinates() an exported function

20cf56b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored codebase #6

Refactored codebase #6

jeff-mandell commented Mar 19, 2024

Refactored codebase #6

Are you sure you want to change the base?

Refactored codebase #6

Conversation

jeff-mandell commented Mar 19, 2024