Recognize arguments in URL paths #28

mthuurne · 2021-08-03T14:12:40Z

If a site uses URLs such as items/123/detail, the item ID 123 in the path should be recognized as an argument (as opposed to a unique page) such that the requests per page limit can be applied to it. Otherwise spidering an application with a large database behind it takes forever.

The text was updated successfully, but these errors were encountered:

mthuurne · 2021-09-08T16:30:37Z

If such auto-detection is not feasible, an alternative approach would be to build a site map and randomly pick a path from the tree to check, with a limit on the total number of checks.

As a quick hack, I tried randomly picking from a list of discovered URLs, but such a list quickly becomes dominated by pages reachable from the first few picks, so the checked pages may not be representative of the whole site.

mthuurne · 2021-09-09T08:46:46Z

The per-page query limit (see #18) could become a per-node limit instead, so it would apply to inner nodes as well. Such a limit can be set high enough that it wouldn't be hit when there are only static subpaths. Then we wouldn't have to guess about the meaning of the path name, which simplifies things a lot.

The similarity between queries and inner nodes doesn't have to end there: we can also generate synthetic requests for inner nodes and check whether we either get an OK (2xx) or client error (4xx) result.

mthuurne added the enhancement New feature or request label Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognize arguments in URL paths #28

Recognize arguments in URL paths #28

mthuurne commented Aug 3, 2021

mthuurne commented Sep 8, 2021

mthuurne commented Sep 9, 2021

Recognize arguments in URL paths #28

Recognize arguments in URL paths #28

Comments

mthuurne commented Aug 3, 2021

mthuurne commented Sep 8, 2021

mthuurne commented Sep 9, 2021