Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add LogisticalRegressionPredictor - rendering type predictor for adaptive crawling #930

Merged
merged 16 commits into from
Jan 29, 2025

Conversation

Pijukatel
Copy link
Contributor

Description

Add LogisticalRegressionPredictor which is going to be default RenderingTypePredictor for AdaptivePlaywrightCrawler.

Issues

Split from: #249

@github-actions github-actions bot added this to the 106th sprint - Tooling team milestone Jan 23, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Jan 23, 2025
@Pijukatel Pijukatel added enhancement New feature or request. and removed tested Temporary label used only programatically for some analytics. labels Jan 23, 2025
@Pijukatel Pijukatel marked this pull request as ready for review January 23, 2025 08:55
@Pijukatel Pijukatel requested review from janbuchar and vdusek January 23, 2025 08:55
@B4nan
Copy link
Member

B4nan commented Jan 23, 2025

LogisticalRegressionPredictor

Isn't that a bit... weird name? :]

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems sound to me, I have just some minor comments. Also please fix typos in the PR title 🙂

pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
# Increased detection_probability_recommendation
prediction = predictor.predict(Request.from_url(url='http://www.aaa.com/some/stuffa', label=label))
assert prediction.rendering_type == 'static'
assert prediction.detection_probability_recommendation == detection_ratio * 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exact checks are probably too implementation-dependent. Could we just check that the recommendation is, like, higher than detection_ratio * 2 for the first couple of results so that we don't have to update the tests each time we fine-tune some constant by a tiny bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is just describing how it actually behaves. Is it requirement or implementation detail? Impossible to say without actually having the requirements. I think it acts nicely as documentation of current behavior and is very easy to change if needed.
If we fine-tune some constant and this test fails, it will be very nice feedback in seeing what those constants actually influence and we can then do informed decision about whether we want to change the test or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a hill I want to die on, but this particular example seems like an overspecified test. Almost to the point where if it breaks, you just adjust it without a second thought, which undermines the value of such test.

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Jan 23, 2025
@Pijukatel Pijukatel changed the title feat: Add LogisticalRegressionPredictor - rendering type predictor for adaptype crawling feat: Add LogisticalRegressionPredictor - rendering type predictor for adaptive crawling Jan 23, 2025
@Pijukatel
Copy link
Contributor Author

LogisticalRegressionPredictor

I thought about naming my future kid like this. But maybe it is not so great name after all. It is some implementation of abstract RenderingTypePredictor and this one will stay internal and will be used on in AdaptivePwCrawler as default version of predictor. Maybe I should call it DefaultRenderingTypePredictor ?

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! And thanks for taking this out from AdaptiveCrawler PR.

@Pijukatel Pijukatel marked this pull request as draft January 23, 2025 15:03
@Pijukatel Pijukatel marked this pull request as ready for review January 24, 2025 08:41
@Pijukatel Pijukatel requested a review from vdusek January 24, 2025 08:42
Comment on lines +9 to +10
from jaro import jaro_winkler_metric
from sklearn.linear_model import LogisticRegression
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should wrap this in a way that the exception tells the user to install the adaptive-playwright extra. But since this is a private subpackage, maybe it can wait until the adaptive playwirght functionality is made public.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits, LGTM otherwise.

pyproject.toml Outdated Show resolved Hide resolved
@Pijukatel Pijukatel merged commit 8440499 into master Jan 29, 2025
23 checks passed
@Pijukatel Pijukatel deleted the logistical-regression-predictor branch January 29, 2025 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants