Enhancing google-maps-scraper: Stopping, Resuming, and File Size #43
Replies: 5 comments
-
Brilliant!
Your Thoughts? |
Beta Was this translation helpful? Give feedback.
-
Thank you for your helpful reply. I realized a better approach is to run the scraping code in a Docker container like you mentioned, and let it run to completion without worrying about high CPU usage on my PC. Instead of stopping and restarting the container, I can just leave it running continuously. To check on the progress, I use the "docker logs -f google-maps-scraper_bot-1_1" command to view the logs. Regarding the large JSON files produced, it's not an issue with too many fields, but just because I'm scraping a lot of data! I'm dealing with the large files by splitting the JSON before loading it into other programs using Notepad++. Overall, running everything in Docker containers allows me to scrape without concerns about resource usage. I can let the scraping run non-stop and use Docker commands to monitor logs and split output files. Your advice helped me a better way to structure my scraping using Docker. |
Beta Was this translation helpful? Give feedback.
-
Great, |
Beta Was this translation helpful? Give feedback.
-
Totally feel you on the hassle of restarting the google-maps-scraper every time the command window closes. A "pause and resume" function would be a lifesaver, especially to avoid redoing work or messing up the For managing those large scrapes without overwriting or losing data, maybe giving Crawlbase a shot could help, especially with its smart handling of requests and data management. It could smooth out some of those bumps, like managing where you left off or dealing with large datasets. Hoping to see these tweaks in the future. It'd make the scraper way more flexible and user-friendly. Cheers to making cool tools even cooler! |
Beta Was this translation helpful? Give feedback.
-
Thanks for suggestion. |
Beta Was this translation helpful? Give feedback.
-
Dear Omkar,
I wanted to provide some feedback on the google-maps-scraper project. When the scraping process is stopped by closing the command window, the scraper starts over from the beginning when run again. It would be helpful to have a "stop and resume later" function, so it can pick up where it left off instead of starting over.
Right now, I handle this by manually removing the already-scraped data from the config.py file before resuming. However, this doesn't help with the all.json file, which overriding all of the previously scraped data. It would be great if there was a way to resume scraping without overriding the all.json file.
Also, when running the scraper for a long time, the all.json file can get very large (e.g. over 100MB). This makes the file difficult to use in some situations. Adding a way to export the data in chunks (e.g. 30MB or 50MB) or letting the user configure the chunk size would help with this issue.
I hope these suggestions are helpful for improving the scraper functionality! Please let me know if any part of my rewrite needs clarification or expansion. I'm happy to provide more details on my use cases and pain points. Looking forward to seeing how this great project evolves over time.
Best regards,
Beta Was this translation helpful? Give feedback.
All reactions