Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored codebase #6

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

jeff-mandell
Copy link

Hi, here's a summary of the contents of the pull request:

  • I organized scripts and data in accordance with R pacakge conventions. Installation of BISCUT along with all R/Python dependencies is, with luck, 1-2 lines:
	# install remotes package if not already installed
	install.packages('remotes')

	# install the forked version and any dependencies
	remotes::install_github('jeff-mandell/BISCUT-py3')

	# after merging the pull request, installation would be 
	# remotes::install_github('beroukhim-lab/BISCUT-py3')

Please let me know if this doesn't work for you. It's supposed to work, but on one system that I tested, I needed to force re-installation of miniconda in order to get BISCUT's Python dependencies to install correctly (via reticulate::install_miniconda(force = T)).

  • I wrapped the BISCUT workflow into R functions so that the user can do everything from the R environment. The worflow is to call make_breakpoint_files() to process a segment file (such as PANCAN_ISAR.seg.txt) and produce a breakpoint file directory, and then to call do_biscut() with the
    breakpoint file directory as an input. Both functions can use parallelization when the cores argument is specified. I made some other optimizations to speed runtime. Besides printing output files, I also made the main function return peak information within R. Here is example usage:
# Assumes segment file exists in working directory.
library(BISCUT)
make_breakpoint_files(segment_file = 'PANCAN_ISAR.seg.txt', output_dir = 'my_breakpoints', cores = 8)
results = do_biscut(breakpoint_file_dir = 'my_breakpoints', results_dir = 'my_biscut_output', cores = 8)

# see ?do_biscut for documentation of output and function arguments

  • The output directory and reference files (abslocs, genelocs) can now be user-specified, as can parameters that could previously only be changed by editing source code (output directory, telcent_thres, amplitude_threshold, confidence interval, n bootstraps). The previously hardcoded data and parameter values are used as defaults when unspecified by the user.

  • The code was doing a lot of error suppression which could prevent the user from knowing about potential problems. I removed most of this and resolved any errors and warnings. There are probably still some that could be triggered by situations or system configurations that I didn't test.

Questions/issues

  • There appeared to be an error in the chromosome coordinates file (now at inst/extdata/SNP6_hg19_chromosome_locs_200605.txt). Every chromosome has q_start1 = centromere + offset except for chr15, where instead q_start1 = q_end1 (= 2409817111). It appears that the q_start1 value was accidentally copied from q_end1, unless there is something I don't understand about the format. I changed the value to match the other chromosomes. Please let me know if it should be changed back.

  • FYI, process_for_ggplot_jagged() doesn't run, partly because it wants to load files ending in plotpeaks.txt. I couldn't find any code in the project that generates such files. Therefore, do_biscut() currently just calls process_for_ggplot().

  • plot_fig_2() is runnable, but the plots look odd. I couldn't tell which plot(s) from the paper they are supposed to resemble. If there isn't a continuing need for the function, I could remove the filter_BISCUT_arms() call from the main function to avoid generating unused files.

  • Previously, there were hard-coded genes of interest--c('CDKN2A','TERT','MYC','BAP1','TERC','TP53','ARID1A','EGFR','PPM1D')--in BISCUT_peak_finding.R. I removed this declaration, as the user can instead call filter_BISCUT_knowngenes (via an R wrapper function, extract_gene_results()) on any desired genes. However, since process_for_ggplot_jagged() isn't running, I am not sure of the use case for filter_BISCUT_knowngenes().

Suggestions:

  • Fill in the placeholder information in the package's top-level DESCRIPTION file: title, description, and contributors. The conventions for filling in contributors are described here: https://r-pkgs.org/description.html#sec-description-authors-at-r.
  • Also in the DESCRIPTION, I temporarily put in version 1.2.0.9000 to indicate a development version. This should be changed to 1.3.0 (or 2.0.0, if you are into semantic versioning, or any other version number you like).
  • The README should be updated. I copied a lot of the descriptions of files and parameters into the R scripts, so that the information appears when a user calls ?do_biscut() or ?make_breakpoint_files(). Therefore, some of this information could be removed from the README. If you think it is a good idea, a compressed segment file, perhaps covering fewer samples than the PANCAN file, could be saved in the repository so that the README could give runnable code on an example segment file.
  • Most of the data from do_biscut() is currently both returned within R and printed to output files, in somewhat different formats. If you are interested, I could further prune down output data and make all output file writing optional. I could also reduce the breakpoint file directory to a single breakpoint file, and make do_biscut() accept either the single file or a data.table/data.frame version of it.

… boundary finding and breakpoint file creation; refactor scripts to reduce number of output files; put output files in user-specified output directory; allow key parameters to be passed to function calls rather than hard-coded in source files. Fix (presumed) error in chr coordinate file. Eliminate warnings and errors in R and Python.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant