Autocrawl

Autocrawl is a web crawler that checks the status of URLs on a domain. It streams the results in real-time and supports depth-limited crawling and concurrency management. Follow the steps below to install and run the app locally or using Docker.

Getting Started

You can run Autocrawl either using Docker or locally. Choose the method that best suits your needs.

Option 1: Running with Docker

Pull the Docker image:
```
docker pull vande012/autocrawl:latest
```

Run the container:

docker run -p 3000:3000 vande012/autocrawl:latest

Open your browser and go to http://localhost:3000 to use the app.

Option 2: Running Locally

To install and run the app locally on Mac or PC, follow these steps:

Prerequisites

Ensure you have the following installed:

Node.js (version 14 or higher)
npm (comes with Node.js) or an alternative package manager like yarn, pnpm, or bun
Git

Installation Steps

Clone the repository: Open a terminal (Mac/Linux) or command prompt (Windows), and clone the repository:
```
git clone https://github.com/vande012/autocrawl.git
```
Navigate into the project directory:
```
cd autocrawl
```
Install dependencies using your preferred package manager:
```
npm install
```
Or if using an alternative package manager:
```
yarn install
```
```
pnpm install
```
```
bun install
```
Run the development server:
```
npm run dev
```
Or with an alternative package manager:
```
yarn dev
```
```
pnpm dev
```
```
bun dev
```
Open the app in your browser: Once the server is running, open your browser and go to:
```
http://localhost:3000
```

You should now see your app running!

Start Editing

You can begin editing the app by modifying the file app/page.tsx. The changes will auto-update as you save.

Learn More

To learn more about Next.js, take a look at the following resources:

Next.js Documentation – learn about Next.js features and API.
Learn Next.js – an interactive Next.js tutorial.

Updating and Running Your Docker Image

Save and Commit Changes: Ensure all your code changes are saved and committed to your project.
Navigate to Project Directory: Open a terminal and cd to your project's root directory.
Build New Docker Image:
```
docker build -t vande012/autocrawl:latest .
```
This builds a new image tagged as 'latest'.
Push to Docker Hub (optional, but recommended for distribution):
```
docker push vande012/autocrawl:latest
```
Run New Container:
```
docker run -p 3000:3000 vande012/autocrawl:latest
```
This runs a container from your new image, mapping port 3000.
Verify: Open a web browser and go to http://localhost:3000 to check if your app is running with the updates.

Cleanup (optional): To remove old containers and images:

docker container prune  # Removes stopped containers
docker image prune      # Removes unused images

Remember to update the README or documentation if there are any changes in functionality or usage.

route.ts Documentation

Overview

This file implements a web crawler using Next.js API routes. It provides a streaming API endpoint that crawls web pages, checks their status, and optionally looks for missing alt text on images or searches for specific terms.

Key Components

POST Handler

Accepts parameters: url, checkAltText, and searchTerm
Creates a ReadableStream to stream results back to the client
Utilizes the Crawler class to perform the crawling operation

Crawler Class

The heart of the crawling functionality, responsible for:

Fetching and parsing robots.txt
Managing a queue of URLs to crawl
Concurrent crawling of pages
Checking page status, alt text, and search terms
Sending updates on crawl progress

Key Features

Streaming Results: Uses ReadableStream to send results in real-time.
Concurrent Requests: Utilizes pLimit to manage concurrent requests (default: 10).
Depth-Limited Crawling: Implements a maximum depth (default: 5) to prevent infinite crawling.
Robots.txt Compliance: Fetches and respects robots.txt rules.
URL Normalization: Normalizes URLs to prevent duplicate crawling.
Batched Updates: Buffers updates and sends them in batches to reduce overhead.

Performance Considerations

Server Requests

Concurrent Requests: Managed by pLimit, set to 10 by default. Adjust CONCURRENT_REQUESTS based on server capacity.
Robots.txt Compliance: Helps avoid overloading servers and respects site owners' wishes.

Speed Optimizations

URL Filtering: Implements isValidUrl to quickly filter out unnecessary URLs.
Depth Limiting: Prevents excessive crawling with MAX_DEPTH.
Batched Updates: Reduces the number of messages sent to the client.
URL Normalization: Prevents recrawling of the same page with slight URL differences.

Configuration

Key constants that can be adjusted:

DEFAULT_USER_AGENT: The user agent string used for requests.
CONCURRENT_REQUESTS: Number of concurrent requests allowed.
MAX_DEPTH: Maximum depth for crawling.
BATCH_SIZE: Number of updates to buffer before sending.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
src		src
.dockerignore		.dockerignore
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
README.md		README.md
components.json		components.json
dockerfile		dockerfile
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
robots-txt-parse.d.ts		robots-txt-parse.d.ts
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autocrawl

Getting Started

Option 1: Running with Docker

Option 2: Running Locally

Prerequisites

Installation Steps

Start Editing

Learn More

Updating and Running Your Docker Image

route.ts Documentation

Overview

Key Components

POST Handler

Crawler Class

Key Features

Performance Considerations

Server Requests

Speed Optimizations

Configuration

About

Releases

Packages

Languages

vande012/autocrawl

Folders and files

Latest commit

History

Repository files navigation

Autocrawl

Getting Started

Option 1: Running with Docker

Option 2: Running Locally

Prerequisites

Installation Steps

Start Editing

Learn More

Updating and Running Your Docker Image

route.ts Documentation

Overview

Key Components

POST Handler

Crawler Class

Key Features

Performance Considerations

Server Requests

Speed Optimizations

Configuration

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages