Autocrawl is a web crawler that checks the status of URLs on a domain. It streams the results in real-time and supports depth-limited crawling and concurrency management. Follow the steps below to install and run the app locally or using Docker.
You can run Autocrawl either using Docker or locally. Choose the method that best suits your needs.
-
Pull the Docker image:
docker pull vande012/autocrawl:latest
-
Run the container:
docker run -p 3000:3000 vande012/autocrawl:latest
-
Open your browser and go to
http://localhost:3000
to use the app.
To install and run the app locally on Mac or PC, follow these steps:
Ensure you have the following installed:
- Node.js (version 14 or higher)
- npm (comes with Node.js) or an alternative package manager like
yarn
,pnpm
, orbun
- Git
-
Clone the repository: Open a terminal (Mac/Linux) or command prompt (Windows), and clone the repository:
git clone https://github.com/vande012/autocrawl.git
-
Navigate into the project directory:
cd autocrawl
-
Install dependencies using your preferred package manager:
npm install
Or if using an alternative package manager:
yarn install
pnpm install
bun install
-
Run the development server:
npm run dev
Or with an alternative package manager:
yarn dev
pnpm dev
bun dev
-
Open the app in your browser: Once the server is running, open your browser and go to:
http://localhost:3000
You should now see your app running!
You can begin editing the app by modifying the file app/page.tsx
. The changes will auto-update as you save.
To learn more about Next.js, take a look at the following resources:
- Next.js Documentation – learn about Next.js features and API.
- Learn Next.js – an interactive Next.js tutorial.
-
Save and Commit Changes: Ensure all your code changes are saved and committed to your project.
-
Navigate to Project Directory: Open a terminal and cd to your project's root directory.
-
Build New Docker Image:
docker build -t vande012/autocrawl:latest .
This builds a new image tagged as 'latest'.
-
Push to Docker Hub (optional, but recommended for distribution):
docker push vande012/autocrawl:latest
-
Run New Container:
docker run -p 3000:3000 vande012/autocrawl:latest
This runs a container from your new image, mapping port 3000.
-
Verify: Open a web browser and go to
http://localhost:3000
to check if your app is running with the updates. -
Cleanup (optional): To remove old containers and images:
docker container prune # Removes stopped containers docker image prune # Removes unused images
Remember to update the README or documentation if there are any changes in functionality or usage.
This file implements a web crawler using Next.js API routes. It provides a streaming API endpoint that crawls web pages, checks their status, and optionally looks for missing alt text on images or searches for specific terms.
- Accepts parameters:
url
,checkAltText
, andsearchTerm
- Creates a
ReadableStream
to stream results back to the client - Utilizes the
Crawler
class to perform the crawling operation
The heart of the crawling functionality, responsible for:
- Fetching and parsing robots.txt
- Managing a queue of URLs to crawl
- Concurrent crawling of pages
- Checking page status, alt text, and search terms
- Sending updates on crawl progress
- Streaming Results: Uses
ReadableStream
to send results in real-time. - Concurrent Requests: Utilizes
pLimit
to manage concurrent requests (default: 10). - Depth-Limited Crawling: Implements a maximum depth (default: 5) to prevent infinite crawling.
- Robots.txt Compliance: Fetches and respects robots.txt rules.
- URL Normalization: Normalizes URLs to prevent duplicate crawling.
- Batched Updates: Buffers updates and sends them in batches to reduce overhead.
- Concurrent Requests: Managed by
pLimit
, set to 10 by default. AdjustCONCURRENT_REQUESTS
based on server capacity. - Robots.txt Compliance: Helps avoid overloading servers and respects site owners' wishes.
- URL Filtering: Implements
isValidUrl
to quickly filter out unnecessary URLs. - Depth Limiting: Prevents excessive crawling with
MAX_DEPTH
. - Batched Updates: Reduces the number of messages sent to the client.
- URL Normalization: Prevents recrawling of the same page with slight URL differences.
Key constants that can be adjusted:
DEFAULT_USER_AGENT
: The user agent string used for requests.CONCURRENT_REQUESTS
: Number of concurrent requests allowed.MAX_DEPTH
: Maximum depth for crawling.BATCH_SIZE
: Number of updates to buffer before sending.