Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved captcha handling #43

Merged
merged 6 commits into from
Apr 5, 2021

Conversation

johndoe-dev00
Copy link
Contributor

I had some trouble with the login process and the captcha.

  • I received more consistent results, by using the login page as first page to be loaded.
  • uBlock needs to be disabled during first login, otherwise captcha wont load correctly
  • added command-line option to disable uBlock
  • added cookie-banner detection and confirmation

@rocketinventor
Copy link
Contributor

@johndoe-dev00 In your testing (after the changes you made), did you find that the captcha still showed up? If so, did the page actually go away after you solved the captcha?

Also, why did you make the maximum time to solve the captcha = one minute? Is there a specific need that it cannot be longer?

@johndoe-dev00
Copy link
Contributor Author

@rocketinventor
My changes do not prevent the captcha from showing up.
At the beginning of my testing the captcha would show up frequently. After a while it became less frequent.
Currently it does not show up at all anymore, even after deleting the cookie file.
Maybe cloudflare white listed my ip or something.
When the captcha actually does show up, you do need to solve it manually.
After solving, you will be redirected back to blinkist and the scraper will continue its work (when the blinkist logo is detected).
As posted by albert in #42, the captcha will fail to load correctly and you will not be able to proceed if uBlock is enabled.
Hence the new command line switch '--no-ublock'

Why 60 sec wait time?
60s should be plenty to solve the captcha.
In case someone is not watching the cli output, I don´t want him to wait 10min before timing out.

@rocketinventor
Copy link
Contributor

rocketinventor commented Mar 29, 2021

If the only reason that uBlock needs to be disabled is to solve the captcha, then you can easily add it to the whitelist (the captcha was being intentionally blocked before):

At the bottom of the bin/ublock/ublock-settings.txt file, there should be a block of text, such: www.blinkist.com hcaptcha.com * block.

Change it to look like this:
www.blinkist.com hcaptcha.com * allow

@johndoe-dev00
Copy link
Contributor Author

@rocketinventor I changed the ublock-settings.txt to allow hcaptcha.com. Seems to work quite well.
I still kept the cli-switch --no-ublock in place, as i see it quite useful for troubleshooting.

FYI: Switching between from seleniumwire import webdriver and from selenium import webdriver (=book audio scrape not working) seems to trigger the captchas. Convenient for testing :)

@leoncvlt leoncvlt merged commit 9febcab into leoncvlt:master Apr 5, 2021
@johndoe-dev00 johndoe-dev00 deleted the captcha-handling branch April 7, 2021 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants