Enhancing google-maps-scraper: Stopping, Resuming, and File Size #43

admiral504 · 2023-09-27T01:43:31Z

admiral504
Sep 27, 2023

Dear Omkar,

I wanted to provide some feedback on the google-maps-scraper project. When the scraping process is stopped by closing the command window, the scraper starts over from the beginning when run again. It would be helpful to have a "stop and resume later" function, so it can pick up where it left off instead of starting over.

Right now, I handle this by manually removing the already-scraped data from the config.py file before resuming. However, this doesn't help with the all.json file, which overriding all of the previously scraped data. It would be great if there was a way to resume scraping without overriding the all.json file.

Also, when running the scraper for a long time, the all.json file can get very large (e.g. over 100MB). This makes the file difficult to use in some situations. Adding a way to export the data in chunks (e.g. 30MB or 50MB) or letting the user configure the chunk size would help with this issue.

I hope these suggestions are helpful for improving the scraper functionality! Please let me know if any part of my rewrite needs clarification or expansion. I'm happy to provide more details on my use cases and pain points. Looking forward to seeing how this great project evolves over time.

Best regards,

Chetan11-dev · 2023-09-27T06:16:55Z

Chetan11-dev
Sep 27, 2023
Maintainer

Brilliant!

Automatic Stop and Resume Feature: This offers great utility and is easy to implement.
File Size Concern: Possible fix is, we can skip the "about," "reviews," and "images" sections, which are the three largest fields, to maintain a concise file size by default. Also, Currently would not like to add this feature, would surely reconsider if more people express this concern. Now, if you are facing this issue you can select specific fields.

Your Thoughts?

0 replies

admiral504 · 2023-09-27T08:12:17Z

admiral504
Sep 27, 2023
Author

Thank you for your helpful reply. I realized a better approach is to run the scraping code in a Docker container like you mentioned, and let it run to completion without worrying about high CPU usage on my PC.

Instead of stopping and restarting the container, I can just leave it running continuously. To check on the progress, I use the "docker logs -f google-maps-scraper_bot-1_1" command to view the logs.

Regarding the large JSON files produced, it's not an issue with too many fields, but just because I'm scraping a lot of data! I'm dealing with the large files by splitting the JSON before loading it into other programs using Notepad++.

Overall, running everything in Docker containers allows me to scrape without concerns about resource usage. I can let the scraping run non-stop and use Docker commands to monitor logs and split output files. Your advice helped me a better way to structure my scraping using Docker.

0 replies

Chetan11-dev · 2023-09-27T08:52:30Z

Chetan11-dev
Sep 27, 2023
Maintainer

Great,
Also Thank you for the suggestion!
I've added an automatic stop and resume feature. To test it out, clone the repo and run python main.py. On the second run, it will skip the scraping process.

0 replies

FebX23 · 2024-02-15T11:50:21Z

FebX23
Feb 15, 2024

Totally feel you on the hassle of restarting the google-maps-scraper every time the command window closes. A "pause and resume" function would be a lifesaver, especially to avoid redoing work or messing up the all.json with duplicates. And yeah, dealing with a gigantic all.json file is no joke—chunking it into smaller bits would be super handy.

For managing those large scrapes without overwriting or losing data, maybe giving Crawlbase a shot could help, especially with its smart handling of requests and data management. It could smooth out some of those bumps, like managing where you left off or dealing with large datasets.

Hoping to see these tweaks in the future. It'd make the scraper way more flexible and user-friendly. Cheers to making cool tools even cooler!

0 replies

Chetan11-dev · 2024-02-15T13:54:25Z

Chetan11-dev
Feb 15, 2024
Maintainer

Thanks for suggestion.
Could you explain with examples
"especially to avoid redoing work or messing up the all.json with duplicates"

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing google-maps-scraper: Stopping, Resuming, and File Size #43

{{title}}

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Enhancing google-maps-scraper: Stopping, Resuming, and File Size #43

admiral504 Sep 27, 2023

Replies: 5 comments

Chetan11-dev Sep 27, 2023 Maintainer

admiral504 Sep 27, 2023 Author

Chetan11-dev Sep 27, 2023 Maintainer

FebX23 Feb 15, 2024

Chetan11-dev Feb 15, 2024 Maintainer

admiral504
Sep 27, 2023

Chetan11-dev
Sep 27, 2023
Maintainer

admiral504
Sep 27, 2023
Author

Chetan11-dev
Sep 27, 2023
Maintainer

FebX23
Feb 15, 2024

Chetan11-dev
Feb 15, 2024
Maintainer