diff --git a/templates/maldi-pipeline.ipynb b/templates/maldi-pipeline.ipynb index de1c35b..af3318f 100644 --- a/templates/maldi-pipeline.ipynb +++ b/templates/maldi-pipeline.ipynb @@ -4,14 +4,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# MALDI Extraction" + "# MALDI Extraction\n", + "\n", + "This notebook is the analysis pipeline for the MALDI data. For each MALDI run, this notebook runs the following workflow:\n", + "\n", + "- **Spectra extraction**: read m/z spectra and corresponding intensities from the binary files\n", + "- **Peak filtering**: identify most prominent m/z peaks, along with their widths, heights, areas, etc.\n", + "- **Coordinate integration**: map peak information onto the MALDI slide\n", + "- **Glycan matching**: map filtered peaks to master glycan list\n", + "- **Core-level analysis (TMAs only)**: crop out specific cores, extract core-level stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Libraries" + "## 0. Import Modules" ] }, { @@ -34,21 +42,21 @@ }, { "cell_type": "markdown", - "metadata": {}, - "source": [ - "## File Paths" - ] - }, - { - "cell_type": "code", - "execution_count": null, "metadata": { "tags": [] }, - "outputs": [], "source": [ - "data_name = \"panc2055_imzML\"\n", - "data_file = pathlib.Path(data_name) / \"panc2055.imzML\"" + "## 1. Define Global Constants\n", + "\n", + "### 1.1. File paths\n", + "\n", + "The following variables are defined pointing to paths in your MALDI run. **Only `base_dir` should be changed, the other folders should maintain the existing names and subdirectory structure.**\n", + "\n", + "* `base_dir`: the path to your MALDI data.\n", + "* `imzml_dir`: the path inside `base_dir` to your `imzml` folder. This will contain the `.imzml` and `.ibd` files extracted from SCiLS.\n", + "* `library_dir`: the path inside `base_dir` to your `libraries` folder. This will contain the master list to use for glycan matching.\n", + "* `extraction_dir`: the path inside `base_dir` to your `extracted` folder. Contains the integrated images for each filtered peak and glycan-matched peak across the slide.\n", + "* `debug_dir`: the path inside `base_dir` to your `debug` folder. Individual peak height and width info is saved here for debugging purposes." ] }, { @@ -59,22 +67,18 @@ }, "outputs": [], "source": [ - "base_dir = pathlib.Path(\"../data\")\n", + "base_dir = pathlib.Path(\"../data/panc2055\")\n", "imzml_dir = base_dir / \"imzml\"\n", "library_dir = base_dir / \"libraries\"\n", - "extraction_dir = base_dir / data_name / \"extracted\"\n", - "debug_dir = base_dir / data_name / \"debug\"" + "extraction_dir = base_dir / \"output\" / \"extracted\"\n", + "debug_dir = base_dir / \"output\" / \"debug\"" ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], + "cell_type": "markdown", + "metadata": {}, "source": [ - "data_path = imzml_dir / data_file" + "Create the directory structure (if it already exists, nothing is overwritten)." ] }, { @@ -86,7 +90,7 @@ "outputs": [], "source": [ "# Create directories\n", - "for directory in [base_dir, library_dir, extraction_dir, debug_dir]:\n", + "for directory in [base_dir, imzml_dir, library_dir, extraction_dir, debug_dir]:\n", " if not os.path.exists(directory):\n", " directory.mkdir(parents=True, exist_ok=True)" ] @@ -95,59 +99,33 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Plotting Parameters" + "**NOTE: At this point, ensure you have the following files in the following locations:**\n", + "\n", + "* **`.imzml`/`.ibd` files**: extracted using SCiLS. **Either explicitly point to the `imzml` subfolder when extracting, or manually copy these files to the `imzml` subfolder afterwards.**\n", + "* **Master glycan list**: defining a singular master glycan list is a WIP. For now, ask a lab member for the peak list to use. **This file must be manually copied into the `libraries` subfolder.**\n", + "\n", + "And define the following variable:\n", + "\n", + "* `imzml_file`: the name of the `.imzml` file in the `imzml` subfolder." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "plt.rcParams[\"figure.figsize\"] = (20, 13)\n", - "plt.rcParams[\"ytick.color\"] = \"w\"\n", - "plt.rcParams[\"xtick.color\"] = \"w\"\n", - "plt.rcParams[\"axes.labelcolor\"] = \"w\"\n", - "plt.rcParams[\"axes.edgecolor\"] = \"w\"\n", - "plt.rcParams[\"axes.facecolor\"] = \"black\"\n", - "plt.rcParams[\"savefig.edgecolor\"] = \"w\"\n", - "plt.rcParams[\"savefig.facecolor\"] = \"black\"\n", - "plt.rcParams[\"figure.facecolor\"] = \"black\"\n", - "plt.rcParams[\"figure.constrained_layout.use\"] = False" - ] - }, - { - "cell_type": "markdown", "metadata": {}, - "source": [ - "## Load necessary files" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### ImzML Data file" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, "outputs": [], "source": [ - "imz_data = ImzMLParser(data_path, include_spectra_metadata=\"full\")" + "imzml_file = \"panc2055.imzml\"\n", + "imzml_path = imzml_dir / imzml_file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Library Peak List" + "### 1.2. Plotting parameters\n", + "\n", + "Define the pltoting parameters." ] }, { @@ -158,42 +136,37 @@ }, "outputs": [], "source": [ - "library_peak_list = library_dir / \"glycan_peaklist_KL.csv\"\n", - "library_peak_df = pd.read_csv(library_peak_list)\n", - "\n", - "library_peak_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Constants" + "plt.rcParams[\"figure.figsize\"] = (20, 13)\n", + "plt.rcParams[\"ytick.color\"] = \"w\"\n", + "plt.rcParams[\"xtick.color\"] = \"w\"\n", + "plt.rcParams[\"axes.labelcolor\"] = \"w\"\n", + "plt.rcParams[\"axes.edgecolor\"] = \"w\"\n", + "plt.rcParams[\"axes.facecolor\"] = \"black\"\n", + "plt.rcParams[\"savefig.edgecolor\"] = \"w\"\n", + "plt.rcParams[\"savefig.facecolor\"] = \"black\"\n", + "plt.rcParams[\"figure.facecolor\"] = \"black\"\n", + "plt.rcParams[\"figure.constrained_layout.use\"] = False" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "intensity_percentile = 99" - ] - }, - { - "cell_type": "markdown", "metadata": {}, - "source": [ - "## Spectrum Extraction" - ] + "outputs": [], + "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Extract the *m/z* and *intensity* values." + "## 2. Data Loading\n", + "\n", + "### 2.1. Spectra\n", + "\n", + "Load in the spectra information defined by the `.imzml`/`.ibd` file. The following variables are extracted:\n", + "\n", + "* `total_mass_df`: tabulates every m/z value found along with their corresponding intensity (summed across all pixels).\n", + "* `thresholds`: defines the nth intensity value (defined by `intensity_percentile`, computed across all m/z values) found across each pixel." ] }, { @@ -204,51 +177,47 @@ }, "outputs": [], "source": [ + "# define the .imzml/.ibd loader object\n", + "imz_data = ImzMLParser(imzml_path, include_spectra_metadata=\"full\")\n", + "\n", + "# extract the spectra and threshold array\n", "total_mass_df, thresholds = extraction.extract_spectra(\n", " imz_data=imz_data, intensity_percentile=intensity_percentile\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "display(total_mass_df)" + ")\n", + "\n", + "# define the global intensity threshold\n", + "intensity_percentile = 99\n", + "global_intensity_threshold = np.percentile(total_mass_df[\"intensity\"].values, intensity_percentile)\n", + "print(f\"Global Intensity Threshold: {global_intensity_threshold}\")\n", + "\n", + "# display the format and the top few peaks extracted\n", + "display(total_mass_df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Global Intensity Threshold\n", - "\n", - "Display the $n$ largest intensities, as well as the $m$-th intensity percentile, and set that as the *global intensity threshold*." + "For additional verification, set `largest_intensity_count` to see the `n` largest intensities." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [] - }, + "metadata": {}, "outputs": [], "source": [ - "largest_intensity_count = 10" + "largest_intensity_count = 10\n", + "total_mass_df.nlargest(largest_intensity_count, [\"intensity\"])" ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], + "cell_type": "markdown", + "metadata": {}, "source": [ - "total_mass_df.nlargest(largest_intensity_count, [\"intensity\"])" + "### 2.2. Master library peaks\n", + "\n", + "Load the master glycan peak list, this will be used by the library matching process." ] }, { @@ -259,22 +228,24 @@ }, "outputs": [], "source": [ - "global_intensity_threshold = np.percentile(total_mass_df[\"intensity\"].values, intensity_percentile)\n", - "print(f\"Global Intensity Threshold: {global_intensity_threshold}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Peak Detection" + "library_peak_list = library_dir / \"glycan_peaklist_KL.csv\"\n", + "library_peak_df = pd.read_csv(library_peak_list)\n", + "\n", + "# visualize the top few master peaks defined\n", + "library_peak_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Rolling Window Method" + "## 3. Peak Analysis\n", + "\n", + "### 3.1. Define prominence thresholds\n", + "\n", + "Although various peaks can be identified across a run's spectra, the vast majority of them will be too small to indicate any meaningful signal. To address this, a prominence-based peak filtering method is used (see https://www.mathworks.com/help/signal/ug/prominence.html for a good definition of prominence).\n", + "\n", + "The first step is to use a rolling window-based approach to extract the prominence thresholds to use for peak filtering." ] }, { @@ -294,7 +265,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Plot Intensities" + "Visualizes how the thresholds you chose affect the peak candidates identified." ] }, { @@ -317,7 +288,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Signal Extraction" + "### 3.2. Extract and filter the m/z peaks\n", + "\n", + "Once you're happy with the prominence thresholds defined in `log_int_percentile`, run the following cell to identify the m/z peaks, which does the following:\n", + "\n", + "1. A traditional local maxima-based approach applies the first filter\n", + "2. For the remaining candidates, the `log_int_percentile` prominence thresholds apply a second filter to remove insignificant peaks" ] }, { @@ -330,16 +306,16 @@ "source": [ "peak_candidate_idxs, peak_candidates = extraction.signal_extraction(\n", " total_mass_df=total_mass_df, log_int_percentile=log_int_percentile\n", - ")" + ")\n", + "\n", + "print(f\"Candiate Peak Count: {len(peak_candidates)}\")" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "print(f\"Candiate Peak Count: {len(peak_candidates)}\")" + "Visualize the discovered peaks." ] }, { @@ -362,7 +338,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Get Peak Widths" + "### 3.3. Compute peak widths\n", + "\n", + "For each peak, compute the corresponding width at 10% of the height defined from the peak's base. This will be necessary for coordinate integration (WIP)." ] }, { @@ -385,16 +363,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Save Peak Spectra" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "save_peak_spectra_debug = True" + "### 3.4. Save Peak Spectra\n", + "\n", + "Define the m/z value of each peak, along with their corresponding lower and upper m/z bounds.\n", + "\n", + "* `save_peak_spectra_debug`: whether to save the corresponding peak spectra graphs to the `debug` folder. We highly recommend leaving this as `True`." ] }, { @@ -405,6 +378,8 @@ }, "outputs": [], "source": [ + "save_peak_spectra_debug = True\n", + "\n", "panel_df = extraction.peak_spectra(\n", " total_mass_df=total_mass_df,\n", " peak_df=peak_df,\n", @@ -415,25 +390,18 @@ " r_ips_r=r_ips_r,\n", " save_peak_spectra_debug=save_peak_spectra_debug,\n", " debug_dir=debug_dir,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "panel_df" + ")\n", + "\n", + "display(panel_df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Integrate Coordinates\n", + "## 4. Coordinate Integration\n", "\n", - "Generate the images and save them in an *xarray*, where the dimensions are: Image (indexed by peak value), $x$, and $y$." + "Once peaks have been identified, we need a way of mapping this information back to the slide. Across a coordinate's m/z spectrum, if it is also an identified peak from step 3, we store the corresponding intensity at that coordinate." ] }, { @@ -447,22 +415,14 @@ "image_data = extraction.coordinate_integration(peak_df=peak_df, imz_data=imz_data)" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "image_data" - ] - }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Histogram preview of the Intensities of a given Peak\n", + "For QC purposes, visualize the intensity distribution around a desired peak\n", "\n", - "Set a value for `desired_peak_hist` (ideally something from your library) and it'll find the nearest peak, and display a histogram of the intensities of the image with `bin_count` bins." + "* `desired_peak_hist`: the peak around where you want to visualize corresponding intensities (ideally something from your library)\n", + "* `bin_count`: number of bins to use for the histogram" ] }, { @@ -474,18 +434,19 @@ "outputs": [], "source": [ "desired_peak_hist = 1809.639659\n", - "bin_count = 40" + "bin_count = 40\n", + "\n", + "_ = image_data.sel(peak=[desired_peak_hist], method=\"nearest\").plot.hist(bins=bin_count)" ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], + "cell_type": "markdown", + "metadata": {}, "source": [ - "image_data.sel(peak=[desired_peak_hist], method=\"nearest\").plot.hist(bins=bin_count)" + "Save the integrated intensity images per peak to the `extracted` folder. For every peak, a float32 and int32 image will be saved to the `float` and `int` subdirectories respectively.\n", + "\n", + "* The `float` images should be used for quantitative downstresam analysis\n", + "* The `int` images are saved for visualization, as they're more compatible with most image viewers" ] }, { @@ -503,14 +464,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Match Glycan Library with Extracted Peaks" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Constants" + "## 5. Glycan Library Matching\n", + "\n", + "While the filtered peaks provide meaningful information, they're not particularly useful without knowing what glycan encompasses them. The master glycan library list (`library_peak_df`) defines the glycans of interest as well as the m/z value they're centered at. In this way, peak values can be mapped to their corresponding glycan within a tolerance range." ] }, { @@ -521,20 +477,16 @@ }, "outputs": [], "source": [ - "ppm = 100" + "matched_peaks_df = extraction.library_matching(\n", + " image_xr=image_data, library_peak_df=library_peak_df, ppm=50, extraction_dir=extraction_dir\n", + ")" ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], + "cell_type": "markdown", + "metadata": {}, "source": [ - "matched_peaks_df = extraction.library_matching(\n", - " image_xr=image_data, library_peak_df=library_peak_df, ppm=ppm, extraction_dir=extraction_dir\n", - ")" + "As with the original peak images, the library matched intensity images are also saved, this time to the `output/extracted/library_matched` folder. There will likewise be `float` and `int` subdirectories containing float32 and int32 representations." ] }, { @@ -554,7 +506,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Core Naming and Cropping" + "## 6. Core Naming and Cropping" ] }, { @@ -697,7 +649,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.6" + "version": "3.11.10" }, "vscode": { "interpreter": {