Skip to content

Commit

Permalink
Merge feature-branch into main, accepting changes from feature-branch
Browse files Browse the repository at this point in the history
  • Loading branch information
haeussma committed May 16, 2024
2 parents 2251ca2 + 98fdfed commit d51826a
Show file tree
Hide file tree
Showing 22 changed files with 595 additions and 24 deletions.
160 changes: 160 additions & 0 deletions docs/examples/alignment.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Align sequences"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "8b73afa8e8b444578f622d239c439673",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
"</pre>\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import json\n",
"from pyeed.core import ProteinRecord\n",
"\n",
"\n",
"# load accession ids from json file\n",
"with open(\"ids.json\", \"r\") as f:\n",
" ids = json.load(f)\n",
"\n",
"sequences = ProteinRecord.get_ids(ids)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multi Sequence Alignment\n",
"\n",
"A multi sequence alignment can be calculated by creating a `MSA` object and passing a list of `ProteinRecord`. The alignment can be executed by calling the `clustalo` method. In order for the `clustalo` method to work, the PyEED Docker Service must be running. The `clustalo` method will return an `AlignmentResult` containing all input `sequences` and `aligned_sequences`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "99f840a5cc0a4441a7d936127815ab36",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"✅ Alignment completed\n"
]
}
],
"source": [
"from pyeed.align import MSA\n",
"\n",
"alignment = MSA(sequences=sequences).clustalo()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a HMM profile\n",
"\n",
"To create a hidden markov model profile, you can use the `HMM` class. This method receives a `MSA` object to create the model. To check if a sequence belongs to the profile, you can use the `search` method. This method takes a `ProteinRecord` object and returns a `HMMResult` object containing the `sequence` and the `score` of the sequence in the profile."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from pyeed.align import HMM\n",
"\n",
"model = HMM(name=\"random profile\", alignment=alignment)\n",
"hits = model.search(sequence=sequences[0])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "pye",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
153 changes: 153 additions & 0 deletions docs/examples/basics.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Get rich sequence information\n",
"\n",
"## Acquire sequence information based on accession id(s)\n",
"\n",
"**Single accession ID**\n",
"\n",
"Single sequences can be retrieved using the `get_id` function. The function takes an accession id as input and returns the sequence as a `ProteinRecord` object. \n",
"The `ProteinRecord` object contains the sequence as a string and additional information such as information on the `Organism`, `Region` or `Site` annotations of the sequence.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from pyeed.core import ProteinRecord\n",
"\n",
"matHM = ProteinRecord.get_id(\"MBP1912539.1\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Multiple accession IDs**\n",
"\n",
"To load multiple sequences at once, the `get_ids` function can be used. The function takes a list of accession IDs as input and returns a list of `ProteinRecord` objects."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"# Load the saved ids from json\n",
"with open(\"ids.json\", \"r\") as f:\n",
" ids = json.load(f)\n",
"\n",
"# Get the protein info for each id\n",
"proteins = ProteinRecord.get_ids(ids)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Serach for similar sequences with BLAST\n",
"\n",
"The `ncbi_blast` method can be used to perform a BLAST search on the NCBI server. The method can be applied to a `ProteinRecord` object and returns a list of `ProteinRecord` objects that represent the hits of the BLAST search.\n",
"By specifying the `n_hits`, `e_value`, `db`, `matrix`, and `identity`, the search can be customized to number of hits, E-value, query database, substitution matrix, and identity to accept the hit, respectively.\n",
"\n",
"<div class=\"admonition warning\">\n",
" <p class=\"admonition-title\">NCBI BLAST service might be slow</p>\n",
" <p>Due to the way NCBI handles requests to its BLAST API the service is quite slow. During peak working hours a single search might take more than 15 min.</p>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"blast_results = matHM.ncbi_blast(\n",
" n_hits=100,\n",
" e_value=0.05,\n",
" db=\"swissprot\",\n",
" matrix=\"BLOSUM62\",\n",
" identity=0.5,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inspect objects\n",
"\n",
"Each `pyeed` object has a rich `print` method, displaying all the information available for the object. This can be useful to inspect the object and its attributes."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[4mProteinRecord\u001b[0m\n",
"├── \u001b[94mid\u001b[0m = MBP1912539.1\n",
"├── \u001b[94mname\u001b[0m = S-adenosylmethionine synthetase\n",
"├── \u001b[94morganism\u001b[0m\n",
"│ └── \u001b[4mOrganism\u001b[0m\n",
"│ ├── \u001b[94mid\u001b[0m = ec01bd4b-490f-4908-aa3c-f8435295e9ef\n",
"│ ├── \u001b[94mtaxonomy_id\u001b[0m = 49900\n",
"│ ├── \u001b[94mname\u001b[0m = Thermococcus stetteri\n",
"│ ├── \u001b[94mdomain\u001b[0m = Archaea\n",
"│ ├── \u001b[94mphylum\u001b[0m = Euryarchaeota\n",
"│ ├── \u001b[94mtax_class\u001b[0m = Thermococci\n",
"│ ├── \u001b[94morder\u001b[0m = Thermococcales\n",
"│ ├── \u001b[94mfamily\u001b[0m = Thermococcaceae\n",
"│ └── \u001b[94mgenus\u001b[0m = Thermococcus\n",
"├── \u001b[94msequence\u001b[0m = MLMAEKIRNIVVEEMVRTPVEMQQVELVERKGIGHPDSIADGIAEAVSRALSREYMKRYGIILHHNTDQVEVVGGRAYPQFGGGEVIKPIYILLSGRAVEMVDREFFPVHEVAIKAAKDYLKKAVRHLDIENHVVIDSRIGQGSVDLVGVFNKAKKNPIPLANDTSFGVGYAPLSETERIVLETEKYLNSDEFKKKWPAVGEDIKVMGLRKGDEIDLTIAAAIVDSEVDNPDDYMAVKEAIYEAAKEIVESHTQRPTNIYVNTADDPKEGIYYITVTGTSAEAGDDGSVGRGNRVNGLITPNRHMSMEAAAGKNPVSHVGKIYNILSMLIANDIAEQIEGVEEVYVRILSQIGKPIDEPLVASVQIIPKKGYSIDVLQKPAYEIADEWLANITKIQKMILEDKINVF\n",
"├── \u001b[94mcoding_sequence\u001b[0m\n",
"│ └── 0\n",
"│ └── \u001b[4mRegion\u001b[0m\n",
"│ ├── \u001b[94mid\u001b[0m = JAGGKB010000004.1\n",
"│ ├── \u001b[94mstart\u001b[0m = 39572\n",
"│ └── \u001b[94mend\u001b[0m = 40795\n",
"└── \u001b[94mec_number\u001b[0m = 2.5.1.6\n",
"\n"
]
}
],
"source": [
"print(matHM)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
26 changes: 26 additions & 0 deletions docs/examples/ids.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[
"A1RSD7",
"A8MD44",
"P0CW62",
"A6VHQ4",
"A9A923",
"B6YUL1",
"Q5V2S5",
"Q4JAL1",
"B1YC36",
"P0CW63",
"Q980S9",
"Q3IQF5",
"Q9V1P7",
"Q8PWS4",
"Q5JF22",
"Q8TU57",
"C5A4B7",
"B0R5A8",
"P26498",
"O67275",
"A7I771",
"Q976F3",
"A3MY01",
"Q58605"
]
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

The API is currently under construction and is subject to change.

## What is PyEED?
## 🤔 What is PyEED?

`pyeed` is a Python toolkit, that allows easy creation, annotation, and analysis of sequence data. All functionalities are based on a data model, which integrates all information on a given nucleotide or protein sequence in a single object. This allows the bundling of all information on a given sequence, making it available in all creation, annotation, and analysis steps. The entire system is generic and applies to various research scenarios.
`pyeed` is designed to enable object-oriented programming for bioinformatics.
Expand All @@ -13,6 +13,6 @@

The data structure of `pyeed` is based on a [data model](https://github.com/PyEED/pyeed/blob/main/specifications/data_model.md)(1), describing the relation between all attributes of a sequence. These attributes include the sequence, the organism, and annotations of the sequence such as sites and regions within the sequence. Furthermore, the information is marked with annotations, marking the origin of the information.

## 🛠️ Tools
## 🧰 Tools

`pyeed` implements common tools for clustering, aligning, and visualizing sequences. CLI tools such as `Clustal Omega` are implemented as a Docker Service, allowing easy installation and usage of these tools.
Loading

0 comments on commit d51826a

Please sign in to comment.