-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
major changes in mztab for large datasets #43
Conversation
make diann2mztab optional for large-scale datasets
WalkthroughThe pull request introduces two main changes: updating the module version number and adding a new command-line option to control mzTab conversion in the Changes
Sequence DiagramsequenceDiagram
participant User
participant diann2mztab Function
User->>diann2mztab Function: Call with enable_diann2mztab flag
alt Flag is True
diann2mztab Function->>mzTab: Convert to mzTab
else Flag is False
diann2mztab Function->>Other Formats: Convert to MSstats/Triqler
end
Poem
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
quantmsutils/diann/diann2mztab.py (1)
Line range hint
1392-1394
: Optimize protein coverage calculation for large datasetsThe protein coverage calculation could be optimized by using vectorized operations and reducing memory usage.
- for acc in acc_to_ids: - matches = fasta_df[fasta_df["id"].str.contains(acc)]["id"] + # Vectorize the operation using pandas str.extract + pattern = "|".join(map(re.escape, acc_to_ids.keys())) + matches = fasta_df["id"].str.extract(f"({pattern})", expand=False) + acc_to_fasta_ids = dict(zip(matches.dropna(), fasta_df.loc[matches.dropna().index, "id"]))Also applies to: 1401-1403
🧹 Nitpick comments (3)
quantmsutils/diann/diann2mztab.py (3)
48-48
: Improve the click option descriptionThe click option could benefit from a more descriptive help message to explain its purpose and default value.
-@click.option("--enable_diann2mztab", "-e", is_flag=True) +@click.option("--enable_diann2mztab", "-e", is_flag=True, help="Enable conversion to mzTab format. Disabled by default to improve performance for large datasets.")
Line range hint
1024-1025
: Consider extracting file finding logic into a separate functionThe file finding logic in
__find_info
could be moved to a utility function as it might be useful in other contexts.+def find_file_by_pattern(directory: Path, pattern: str) -> Path: + """Find a file matching the given pattern in the directory. + + Args: + directory: Directory to search in + pattern: File pattern to match + + Returns: + Path to the matched file + + Raises: + ValueError: If no file or multiple files are found + """ + files = list(Path(directory).rglob(pattern)) + if not files: + raise ValueError(f"Could not find file matching pattern {pattern} in {directory}") + if len(files) > 1: + raise ValueError(f"Found multiple files matching pattern {pattern} in {directory}: {files}") + return files[0] + def __find_info(directory, n): - files = list(Path(directory).rglob(f"{n}_ms_info.parquet")) - if not files: - raise ValueError(f"Could not find {n} info file in {directory}") - if len(files) > 1: - raise ValueError(f"Found multiple {n} info files in {directory}: {files}") - return files[0] + return find_file_by_pattern(directory, f"{n}_ms_info.parquet")Also applies to: 1031-1032
Line range hint
1024-1032
: Add input validation for directory parameterThe
__find_info
function should validate the input directory parameter.def __find_info(directory, n): + if not isinstance(directory, (str, Path)): + raise TypeError(f"directory must be a string or Path, got {type(directory)}") + directory = Path(directory) + if not directory.exists(): + raise ValueError(f"Directory {directory} does not exist") files = list(Path(directory).rglob(f"{n}_ms_info.parquet")) if not files: raise ValueError(f"Could not find {n} info file in {directory}") if len(files) > 1: raise ValueError(f"Found multiple {n} info files in {directory}: {files}") return files[0]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
quantmsutils/__init__.py
(1 hunks)quantmsutils/diann/diann2mztab.py
(3 hunks)
✅ Files skipped from review due to trivial changes (1)
- quantmsutils/init.py
🔇 Additional comments (1)
quantmsutils/diann/diann2mztab.py (1)
182-192
: LGTM! Well-structured conditional block
The mzTab conversion is properly encapsulated within the conditional block, allowing users to skip this conversion for better performance with large datasets.
diann_directory.convert_to_mztab( | ||
report=report, | ||
f_table=f_table, | ||
charge=charge, | ||
missed_cleavages=missed_cleavages, | ||
dia_params=dia_params, | ||
out=mztab_out, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling in convert_to_mztab
The function should handle potential errors during file operations and data processing.
diann_directory.convert_to_mztab(
report=report,
f_table=f_table,
charge=charge,
missed_cleavages=missed_cleavages,
dia_params=dia_params,
out=mztab_out,
+ try:
+ diann_directory.convert_to_mztab(
+ report=report,
+ f_table=f_table,
+ charge=charge,
+ missed_cleavages=missed_cleavages,
+ dia_params=dia_params,
+ out=mztab_out,
+ )
+ except Exception as e:
+ logger.error(f"Failed to convert to mzTab format: {str(e)}")
+ raise
Committable suggestion skipped: line range outside the PR's diff.
Summary by CodeRabbit
New Features
--enable_diann2mztab
for more flexible mzTab conversion control.Documentation
diann2mztab
function to include the new parameter.