HTML

HTML Crawler

The HTML Crawler will by default index all pages with response header type text/html and the links must either contain ``, .html, `.htm` or `.php`.

Custom Content Tags

tag	example	description
CRAWL_IGNORE	`<!-- [CRAWL_IGNORE] -->Ignore this<!-- [/CRAWL_IGNORE] -->`	Ignores a certain content from indexing.
CRAWL_FULL_IGNORE	`<!-- [CRAWL_FULL_IGNORE] -->`	Ignore a full page for the crawler, keep in mind that links will be added to index inside the ignore page.
CRAWL_GROUP	`<!-- [CRAWL_GROUP]api[/CRAWL_GROUP] -->`	Sometimes you want to group your results by a section of a page, in order to let crawler know about the group/section of your current page. Now you can group your results by the `group` field.
CRAWL_TITLE	`<!-- [CRAWL_TITLE]My Title[/CRAWL_TITLE] -->`	If you want to make sure to always use your customized title you can use the CRAWL_TITLE tag to ensure your title for the page:

Parsers
- HTML
- PDF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML

HTML Crawler

Custom Content Tags

Clone this wiki locally