Merge feature-branch into main, accepting changes from feature-branch

PyEED · May 16, 2024 · d51826a · d51826a
2 parents 2251ca2 + 98fdfed
commit d51826a
Show file tree

Hide file tree

Showing 22 changed files with 595 additions and 24 deletions.
diff --git a/docs/examples/alignment.ipynb b/docs/examples/alignment.ipynb
@@ -0,0 +1,160 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Align sequences"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "8b73afa8e8b444578f622d239c439673",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "import json\n",
+    "from pyeed.core import ProteinRecord\n",
+    "\n",
+    "\n",
+    "# load accession ids from json file\n",
+    "with open(\"ids.json\", \"r\") as f:\n",
+    "    ids = json.load(f)\n",
+    "\n",
+    "sequences = ProteinRecord.get_ids(ids)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Multi Sequence Alignment\n",
+    "\n",
+    "A multi sequence alignment can be calculated by creating a `MSA` object and passing a list of `ProteinRecord`. The alignment can be executed by calling the `clustalo` method. In order for the `clustalo` method to work, the PyEED Docker Service must be running. The `clustalo` method will return an `AlignmentResult` containing all input `sequences` and `aligned_sequences`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "99f840a5cc0a4441a7d936127815ab36",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✅ Alignment completed\n"
+     ]
+    }
+   ],
+   "source": [
+    "from pyeed.align import MSA\n",
+    "\n",
+    "alignment = MSA(sequences=sequences).clustalo()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create a HMM profile\n",
+    "\n",
+    "To create a hidden markov model profile, you can use the `HMM` class. This method receives a `MSA` object to create the model. To check if a sequence belongs to the profile, you can use the `search` method. This method takes a `ProteinRecord` object and returns a `HMMResult` object containing the `sequence` and the `score` of the sequence in the profile."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyeed.align import HMM\n",
+    "\n",
+    "model = HMM(name=\"random profile\", alignment=alignment)\n",
+    "hits = model.search(sequence=sequences[0])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "pye",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/examples/basics.ipynb b/docs/examples/basics.ipynb
@@ -0,0 +1,153 @@
+{
+    "cells": [
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "# Get rich sequence information\n",
+                "\n",
+                "## Acquire sequence information based on accession id(s)\n",
+                "\n",
+                "**Single accession ID**\n",
+                "\n",
+                "Single sequences can be retrieved using the `get_id` function. The function takes an accession id as input and returns the sequence as a `ProteinRecord` object.  \n",
+                "The `ProteinRecord` object contains the sequence as a string and additional information such as information on the `Organism`, `Region` or `Site` annotations of the sequence.\n"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 2,
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "from pyeed.core import ProteinRecord\n",
+                "\n",
+                "matHM = ProteinRecord.get_id(\"MBP1912539.1\")"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "**Multiple accession IDs**\n",
+                "\n",
+                "To load multiple sequences at once, the `get_ids` function can be used. The function takes a list of accession IDs as input and returns a list of `ProteinRecord` objects."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "import json\n",
+                "\n",
+                "# Load the saved ids from json\n",
+                "with open(\"ids.json\", \"r\") as f:\n",
+                "    ids = json.load(f)\n",
+                "\n",
+                "# Get the protein info for each id\n",
+                "proteins = ProteinRecord.get_ids(ids)"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "## Serach for similar sequences with BLAST\n",
+                "\n",
+                "The `ncbi_blast` method can be used to perform a BLAST search on the NCBI server. The method can be applied to a `ProteinRecord` object and returns a list of `ProteinRecord` objects that represent the hits of the BLAST search.\n",
+                "By specifying the `n_hits`, `e_value`, `db`, `matrix`, and `identity`, the search can be customized to number of hits, E-value, query database, substitution matrix, and identity to accept the hit, respectively.\n",
+                "\n",
+                "<div class=\"admonition warning\">\n",
+                "    <p class=\"admonition-title\">NCBI BLAST service might be slow</p>\n",
+                "    <p>Due to the way NCBI handles requests to its BLAST API the service is quite slow. During peak working hours a single search might take more than 15 min.</p>\n",
+                "</div>"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "blast_results = matHM.ncbi_blast(\n",
+                "    n_hits=100,\n",
+                "    e_value=0.05,\n",
+                "    db=\"swissprot\",\n",
+                "    matrix=\"BLOSUM62\",\n",
+                "    identity=0.5,\n",
+                ")"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "## Inspect objects\n",
+                "\n",
+                "Each `pyeed` object has a rich `print` method, displaying all the information available for the object. This can be useful to inspect the object and its attributes."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 2,
+            "metadata": {},
+            "outputs": [
+                {
+                    "name": "stdout",
+                    "output_type": "stream",
+                    "text": [
+                        "\u001b[4mProteinRecord\u001b[0m\n",
+                        "├── \u001b[94mid\u001b[0m = MBP1912539.1\n",
+                        "├── \u001b[94mname\u001b[0m = S-adenosylmethionine synthetase\n",
+                        "├── \u001b[94morganism\u001b[0m\n",
+                        "│   └── \u001b[4mOrganism\u001b[0m\n",
+                        "│       ├── \u001b[94mid\u001b[0m = ec01bd4b-490f-4908-aa3c-f8435295e9ef\n",
+                        "│       ├── \u001b[94mtaxonomy_id\u001b[0m = 49900\n",
+                        "│       ├── \u001b[94mname\u001b[0m = Thermococcus stetteri\n",
+                        "│       ├── \u001b[94mdomain\u001b[0m = Archaea\n",
+                        "│       ├── \u001b[94mphylum\u001b[0m = Euryarchaeota\n",
+                        "│       ├── \u001b[94mtax_class\u001b[0m = Thermococci\n",
+                        "│       ├── \u001b[94morder\u001b[0m = Thermococcales\n",
+                        "│       ├── \u001b[94mfamily\u001b[0m = Thermococcaceae\n",
+                        "│       └── \u001b[94mgenus\u001b[0m = Thermococcus\n",
+                        "├── \u001b[94msequence\u001b[0m = MLMAEKIRNIVVEEMVRTPVEMQQVELVERKGIGHPDSIADGIAEAVSRALSREYMKRYGIILHHNTDQVEVVGGRAYPQFGGGEVIKPIYILLSGRAVEMVDREFFPVHEVAIKAAKDYLKKAVRHLDIENHVVIDSRIGQGSVDLVGVFNKAKKNPIPLANDTSFGVGYAPLSETERIVLETEKYLNSDEFKKKWPAVGEDIKVMGLRKGDEIDLTIAAAIVDSEVDNPDDYMAVKEAIYEAAKEIVESHTQRPTNIYVNTADDPKEGIYYITVTGTSAEAGDDGSVGRGNRVNGLITPNRHMSMEAAAGKNPVSHVGKIYNILSMLIANDIAEQIEGVEEVYVRILSQIGKPIDEPLVASVQIIPKKGYSIDVLQKPAYEIADEWLANITKIQKMILEDKINVF\n",
+                        "├── \u001b[94mcoding_sequence\u001b[0m\n",
+                        "│   └── 0\n",
+                        "│       └── \u001b[4mRegion\u001b[0m\n",
+                        "│           ├── \u001b[94mid\u001b[0m = JAGGKB010000004.1\n",
+                        "│           ├── \u001b[94mstart\u001b[0m = 39572\n",
+                        "│           └── \u001b[94mend\u001b[0m = 40795\n",
+                        "└── \u001b[94mec_number\u001b[0m = 2.5.1.6\n",
+                        "\n"
+                    ]
+                }
+            ],
+            "source": [
+                "print(matHM)"
+            ]
+        }
+    ],
+    "metadata": {
+        "kernelspec": {
+            "display_name": "Python 3 (ipykernel)",
+            "language": "python",
+            "name": "python3"
+        },
+        "language_info": {
+            "codemirror_mode": {
+                "name": "ipython",
+                "version": 3
+            },
+            "file_extension": ".py",
+            "mimetype": "text/x-python",
+            "name": "python",
+            "nbconvert_exporter": "python",
+            "pygments_lexer": "ipython3",
+            "version": "3.11.5"
+        }
+    },
+    "nbformat": 4,
+    "nbformat_minor": 4
+}
diff --git a/docs/examples/ids.json b/docs/examples/ids.json
@@ -0,0 +1,26 @@
+[
+    "A1RSD7",
+    "A8MD44",
+    "P0CW62",
+    "A6VHQ4",
+    "A9A923",
+    "B6YUL1",
+    "Q5V2S5",
+    "Q4JAL1",
+    "B1YC36",
+    "P0CW63",
+    "Q980S9",
+    "Q3IQF5",
+    "Q9V1P7",
+    "Q8PWS4",
+    "Q5JF22",
+    "Q8TU57",
+    "C5A4B7",
+    "B0R5A8",
+    "P26498",
+    "O67275",
+    "A7I771",
+    "Q976F3",
+    "A3MY01",
+    "Q58605"
+]
diff --git a/docs/index.md b/docs/index.md
@@ -4,7 +4,7 @@
 
     The API is currently under construction and is subject to change.
 
-## What is PyEED?
+## 🤔 What is PyEED?
 
 `pyeed` is a Python toolkit, that allows easy creation, annotation, and analysis of sequence data. All functionalities are based on a data model, which integrates all information on a given nucleotide or protein sequence in a single object. This allows the bundling of all information on a given sequence, making it available in all creation, annotation, and analysis steps. The entire system is generic and applies to various research scenarios.  
 `pyeed` is designed to enable object-oriented programming for bioinformatics. 
@@ -13,6 +13,6 @@
 
 The data structure of `pyeed` is based on a [data model](https://github.com/PyEED/pyeed/blob/main/specifications/data_model.md)(1), describing the relation between all attributes of a sequence. These attributes include the sequence, the organism, and annotations of the sequence such as sites and regions within the sequence. Furthermore, the information is marked with annotations, marking the origin of the information. 
 
-## 🛠️ Tools
+## 🧰 Tools
 
 `pyeed` implements common tools for clustering, aligning, and visualizing sequences. CLI tools such as `Clustal Omega` are implemented as a Docker Service, allowing easy installation and usage of these tools.