feat: support classification #95

n0w0f · 2024-08-13T23:35:44Z

There is lot of duplications, so as to avoid any breakage at this moment.
might have to refactor later.

we could also choose to not merge it. but would be good to review the code.

Summary by Sourcery

Add support for classification tasks by introducing new classes and methods, refactor benchmarking and task execution logic for better modularity, and enhance model fine-tuning and inference processes with abstract base classes. Include new configuration files and scripts for data preparation and model setup.

New Features:

Introduce support for classification tasks by adding new classes and methods for handling classification-specific operations.
Add new configuration files for various model representations and datasets, enabling flexible model training and evaluation setups.

Enhancements:

Refactor the benchmarking and task running logic to use a more modular and abstract base class structure, improving code organization and reusability.
Enhance the task runner to dynamically map and execute tasks based on a predefined task map, allowing for more flexible task execution.
Improve the model fine-tuning and inference processes by introducing abstract base classes and methods, facilitating easier extension and customization.

Documentation:

Add new configuration files for different model setups, providing examples and templates for various training scenarios.

Chores:

Add multiple new configuration files and scripts for data preparation and processing, supporting the new classification feature and other model setups.

…vision-bg

…into revision-bg

sourcery-ai · 2024-08-13T23:35:53Z

Reviewer's Guide by Sourcery

This pull request introduces support for classification tasks in the existing machine learning pipeline. It includes significant changes to the model architecture, benchmarking process, and data handling. The changes are implemented across multiple files, with major updates to the core functionality in src/mattext/models/benchmark.py, src/mattext/main.py, src/mattext/models/finetune.py, src/mattext/models/predict.py, and src/mattext/models/score.py. New configuration files and data preparation scripts have also been added to support the classification tasks.

File-Level Changes

Files	Changes
`src/mattext/models/benchmark.py` `src/mattext/models/predict.py`	Introduced abstract base classes (BaseBenchmark, BaseInference) to support both regression and classification tasks
`src/mattext/models/benchmark.py` `src/mattext/models/predict.py` `src/mattext/models/finetune.py`	Added new classes for classification tasks (MatbenchmarkClassification, InferenceClassification, FinetuneClassificationModel)
`src/mattext/main.py`	Updated the main execution flow to include classification tasks
`src/mattext/models/score.py` `src/mattext/models/predict.py`	Implemented new metrics and evaluation methods for classification tasks
`conf/model/classification_example.yaml` `conf/model/formation_energy.yaml` `conf/model/llama_8b.yaml`	Added new configuration files for various model representations and datasets
`revision-scripts/prep_rep.py` `revision-scripts/text_rep.py` `revision-scripts/prep_json.py` `revision-scripts/mp_classification.py` `revision-scripts/5fold_split.py`	Created new scripts for data preparation and processing
`conf/benchmark.yaml` `conf/llm_sft.yaml` `conf/model/benchmark_example.yaml`	Updated existing configuration files to support new tasks and model representations

Tips

Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
Continue your discussion with Sourcery by replying directly to review comments.
You can change your review settings at any time by accessing your dashboard:
- Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
- Change the review language;
You can always contact us if you have any questions or feedback.

sourcery-ai

Hey @n0w0f - I've reviewed your changes - here's some feedback:

Overall Comments:

Consider refactoring common code patterns across the new classification and benchmark files to reduce duplication and improve maintainability.
Improve consistency in the use of type hints and docstrings throughout the new code to enhance readability and maintainability.
Review the configuration files for different models and representations, and consider creating a more modular structure to reduce repetition in the YAML files.

Here's what I looked at during the review

🟡 General issues: 4 issues found
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 3 issues found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

src/mattext/models/predict.py

sourcery-ai · 2024-08-13T23:37:43Z

src/mattext/models/predict.py

+
+        return prediction_df, prediction_ids
+
+    def evaluate(self, true_labels: List[int]) -> dict:


suggestion: Consider incorporating true labels into _prepare_datasets method

The evaluate method requires true_labels as a separate argument. Consider loading true labels along with the test data in the _prepare_datasets method for better encapsulation and consistency.

def evaluate(self, test_data: pd.DataFrame) -> dict: true_labels = test_data['label'].tolist()

src/mattext/models/benchmark.py

src/mattext/models/score.py

src/mattext/main.py

revision-scripts/matbench_is_metal.py

revision-scripts/mp_classification.py

src/mattext/main.py

src/mattext/models/benchmark.py

n0w0f · 2024-08-14T11:41:35Z

@sourcery-ai review
Can you check if the roc_auc and metric computation is done correctly in score.py ?

sourcery-ai

Hey @n0w0f - I've reviewed your changes - here's some feedback:

Overall Comments:

Consider refactoring the new classification classes to reduce code duplication with existing regression classes.
Evaluate the possibility of separating classification logic more distinctly from the existing benchmark and inference code to improve maintainability.

Here's what I looked at during the review

🟡 General issues: 1 issue found
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 2 issues found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

src/mattext/models/predict.py

sourcery-ai · 2024-08-14T11:43:20Z

src/mattext/main.py

@@ -22,6 +22,9 @@ def __init__(self):
    def run_task(self, run: list, task_cfg: DictConfig, local_rank=None) -> None:


issue (complexity): Consider refactoring to reduce duplication and improve maintainability.

The recent changes have increased the complexity of the code, primarily due to the added conditional logic and duplication in the run_task method. The introduction of the "classification" condition has increased the number of branches, making the logic harder to follow. Additionally, the run_classification method is very similar to run_benchmarking, leading to duplicated logic that can complicate maintenance. To address these issues, consider refactoring to reduce duplication and improve maintainability. For example, you could use a helper method to consolidate common logic and a dictionary to map task names to methods, simplifying the run_task logic and making it easier to extend or modify. This approach would centralize task execution logic, improving maintainability and reducing complexity.

src/mattext/models/benchmark.py

revision-scripts/matbench_is_metal.py

revision-scripts/mp_classification.py

src/mattext/main.py

src/mattext/models/benchmark.py

sourcery-ai · 2024-08-14T11:43:20Z

src/mattext/models/benchmark.py

+                name=exp_name,
+            )
+            fold_name = fold_key_namer(i)
+            print("-------------------------")


issue (code-quality): Extract duplicate code into method (extract-duplicate-method)

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

kjappelbaum · 2024-08-14T20:58:32Z

As you and Sourcery said, there is a lot of duplication, and I'm unsure if we should merge it in the current form.

kjappelbaum · 2024-08-14T21:03:07Z

revision-scripts/prep_rep.py

+        return None
+
+
+def process_entry_test_matbench(entry: List, timeout: int) -> dict:


we should also try to revise functions like this one later - e.g. timeout seems to not be used

kjappelbaum · 2024-08-14T21:03:37Z

revision-scripts/prep_rep.py

+
+    batch_iterator = (data[i:i + batch_size] for i in range(0, len(data), batch_size))
+
+    for i, batch_data in enumerate(batch_iterator, start=1):


why do you actually start at 1 and not 0?

src/mattext/main.py

src/mattext/models/classification.py

kjappelbaum · 2024-08-14T21:10:47Z

@sourcery-ai review Can you check if the roc_auc and metric computation is done correctly in score.py ?

Is there anything particular you have concerns about, @n0w0f ?

n0w0f · 2024-08-14T21:14:15Z

@sourcery-ai review Can you check if the roc_auc and metric computation is done correctly in score.py ?

Is there anything particular you have concerns about, @n0w0f ?

I wanted to jsut confirm if the metrics are correctly computed.

n0w0f · 2024-08-14T21:15:00Z

@kjappelbaum I agree. we need not merge this. I did not want to break the codebase hence the duplications I can clean up after the sprint

kjappelbaum · 2024-08-15T20:43:24Z

i'd also consider moving to something like unsloth for fine-tuning as it is much faster

kjappelbaum · 2024-08-15T21:03:50Z

Another point we had mentioned on Zulip was to add

batching
progress bars

… run_experiment

…cation

n0w0f · 2024-08-20T12:50:04Z

@sourcery-ai review

sourcery-ai

Hey @n0w0f - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟡 General issues: 4 issues found
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 1 issue found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

src/mattext/models/benchmark.py

sourcery-ai · 2024-08-20T12:52:11Z

src/mattext/models/benchmark.py

+            print(
+                f"Error occurred during inference for finetuned checkpoint '{exp_name}': {str(e)}"
+            )
+            if isinstance(e, (ValueError, TypeError)):


suggestion: Improve granularity of exception handling

Consider handling more specific exceptions and providing more detailed error messages or logging. This will make debugging easier and provide more informative feedback.

if isinstance(e, (ValueError, TypeError, RuntimeError, AttributeError)): logger.error(f"Critical error during inference: {e}") raise elif isinstance(e, (FileNotFoundError, IOError)): logger.warning(f"File-related error during inference: {e}") else: logger.exception("Unexpected error during inference")

could be a different PR. logger is not used yet

That's a valid point. If the logger isn't currently being used in the codebase, introducing it might be better suited for a separate PR to keep changes focused and manageable. For now, you could consider adding more specific exception handling without the logger, and then plan to integrate logging in a future update. This way, you can still improve the granularity of exception handling in this PR without introducing new dependencies. Let me know if you need any help with that!

sourcery-ai · 2024-08-20T12:52:11Z

src/mattext/main.py

+            else:
+                print(f"Unknown task: {task}")
+
+    def _run_experiment(


suggestion: Break down _run_experiment into smaller methods

The _run_experiment method is quite long and handles multiple responsibilities. Consider breaking it down into smaller, more focused methods to improve readability and maintainability.

def _run_experiment(self, task_cfg: DictConfig): return self._execute_experiment_steps(task_cfg) def _execute_experiment_steps(self, task_cfg: DictConfig): # Move the existing content of _run_experiment here # and break it down into smaller methods as needed

sourcery-ai · 2024-08-20T12:52:11Z

src/mattext/models/predict.py

+            probabilities, columns=[f"class_{i}" for i in range(self.num_labels)]
+        )
+
+    def evaluate(self, true_labels: List[int]) -> dict:


suggestion: Break down evaluate method into smaller functions

The evaluate method in InferenceClassification is quite long and performs multiple operations. Consider breaking it down into smaller, more focused methods for each evaluation metric or step.

def evaluate(self, true_labels: List[int]) -> dict: predictions = self._get_predictions() metrics = self._calculate_metrics(predictions, true_labels) return metrics def _get_predictions(self) -> np.ndarray: predictions, _ = self.predict() return np.argmax(predictions.values, axis=1) def _calculate_metrics(self, predictions: np.ndarray, true_labels: List[int]) -> dict:

src/mattext/models/score.py

revision-scripts/mp_classification.py

src/mattext/models/benchmark.py

n0w0f · 2024-08-22T09:36:03Z

Refactored to incorporate the comments. can you @sourcery-ai review

…assification

n0w0f added 9 commits August 11, 2024 00:26

feat: scripts prep classification

b2bc253

chore: configs for bg and form

59fc287

configs for llama-run

7ea9c0b

chore: update config

94cc3e9

Merge branch 'main' of https://github.com/lamalab-org/MatText into re…

015c83c

…vision-bg

Merge branch 'revision-bg' of https://github.com/lamalab-org/MatText …

4a572ed

…into revision-bg

feat: add classification

f03074e

chore: update

ea18cf3

fix: classification benchmarking

3cafe20

sourcery-ai bot reviewed Aug 13, 2024

View reviewed changes

sourcery-ai bot reviewed Aug 14, 2024

View reviewed changes

kjappelbaum and others added 3 commits August 14, 2024 22:57

Update src/mattext/main.py

0dfe9ee

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

Update revision-scripts/mp_classification.py

0de6dda

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

Update revision-scripts/matbench_is_metal.py

07ef612

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

kjappelbaum reviewed Aug 14, 2024

View reviewed changes

src/mattext/main.py Outdated Show resolved Hide resolved

kjappelbaum reviewed Aug 14, 2024

View reviewed changes

src/mattext/models/classification.py Outdated Show resolved Hide resolved

n0w0f added 4 commits August 20, 2024 13:11

chore: remove deduplication, dictionary mapping task, multifold under…

d872722

… run_experiment

chore: abstract out benchmarking to a base class for reg and classifi…

f4e8f4e

…cation

chore: abstract out finetuning to base class for reg and classification

5e5bdd1

chore: abstract out inference from reg and classification to base

c6548a8

chore: refactor task

824471a

sourcery-ai bot reviewed Aug 20, 2024

View reviewed changes

n0w0f added 2 commits August 20, 2024 14:59

chore: improve Mattext Tasks

e26424b

chore: improve benchmarking abstraction

624dd4c

n0w0f mentioned this pull request Aug 21, 2024

support unsloth for fine-tuning as it is much faster #100

Open

refactor

c45f568

n0w0f added 3 commits September 22, 2024 19:25

Merge branch 'main' of https://github.com/lamalab-org/MatText into cl…

e29825f

…assification

fix: add logger

bfd872c

fix: empty gpu vram after each fold

2d5d901

n0w0f merged commit 866a0e2 into main Nov 26, 2024
1 of 2 checks passed

n0w0f deleted the classification branch November 26, 2024 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support classification #95

feat: support classification #95

n0w0f commented Aug 13, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 13, 2024 •

edited

Loading

sourcery-ai bot left a comment

sourcery-ai bot Aug 13, 2024

n0w0f commented Aug 14, 2024 •

edited

Loading

sourcery-ai bot left a comment

sourcery-ai bot Aug 14, 2024

sourcery-ai bot Aug 14, 2024

kjappelbaum commented Aug 14, 2024

kjappelbaum Aug 14, 2024

kjappelbaum Aug 14, 2024

kjappelbaum commented Aug 14, 2024

n0w0f commented Aug 14, 2024

n0w0f commented Aug 14, 2024

kjappelbaum commented Aug 15, 2024

kjappelbaum commented Aug 15, 2024

n0w0f commented Aug 20, 2024

sourcery-ai bot left a comment

sourcery-ai bot Aug 20, 2024

n0w0f Aug 21, 2024

sourcery-ai bot Aug 21, 2024

sourcery-ai bot Aug 20, 2024

sourcery-ai bot Aug 20, 2024

n0w0f commented Aug 22, 2024


		return prediction_df, prediction_ids

		def evaluate(self, true_labels: List[int]) -> dict:

		@@ -22,6 +22,9 @@ def __init__(self):
		def run_task(self, run: list, task_cfg: DictConfig, local_rank=None) -> None:

		return None


		def process_entry_test_matbench(entry: List, timeout: int) -> dict:


		batch_iterator = (data[i:i + batch_size] for i in range(0, len(data), batch_size))

		for i, batch_data in enumerate(batch_iterator, start=1):

feat: support classification #95

feat: support classification #95

Conversation

n0w0f commented Aug 13, 2024 • edited by sourcery-ai bot Loading

Summary by Sourcery

sourcery-ai bot commented Aug 13, 2024 • edited Loading

Reviewer's Guide by Sourcery

File-Level Changes

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Aug 13, 2024

Choose a reason for hiding this comment

n0w0f commented Aug 14, 2024 • edited Loading

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Aug 14, 2024

Choose a reason for hiding this comment

sourcery-ai bot Aug 14, 2024

Choose a reason for hiding this comment

kjappelbaum commented Aug 14, 2024

kjappelbaum Aug 14, 2024

Choose a reason for hiding this comment

kjappelbaum Aug 14, 2024

Choose a reason for hiding this comment

kjappelbaum commented Aug 14, 2024

n0w0f commented Aug 14, 2024

n0w0f commented Aug 14, 2024

kjappelbaum commented Aug 15, 2024

kjappelbaum commented Aug 15, 2024

n0w0f commented Aug 20, 2024

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Aug 20, 2024

Choose a reason for hiding this comment

n0w0f Aug 21, 2024

Choose a reason for hiding this comment

sourcery-ai bot Aug 21, 2024

Choose a reason for hiding this comment

sourcery-ai bot Aug 20, 2024

Choose a reason for hiding this comment

sourcery-ai bot Aug 20, 2024

Choose a reason for hiding this comment

n0w0f commented Aug 22, 2024

n0w0f commented Aug 13, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 13, 2024 •

edited

Loading

n0w0f commented Aug 14, 2024 •

edited

Loading