You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a site uses URLs such as items/123/detail, the item ID 123 in the path should be recognized as an argument (as opposed to a unique page) such that the requests per page limit can be applied to it. Otherwise spidering an application with a large database behind it takes forever.
The text was updated successfully, but these errors were encountered:
If such auto-detection is not feasible, an alternative approach would be to build a site map and randomly pick a path from the tree to check, with a limit on the total number of checks.
As a quick hack, I tried randomly picking from a list of discovered URLs, but such a list quickly becomes dominated by pages reachable from the first few picks, so the checked pages may not be representative of the whole site.
The per-page query limit (see #18) could become a per-node limit instead, so it would apply to inner nodes as well. Such a limit can be set high enough that it wouldn't be hit when there are only static subpaths. Then we wouldn't have to guess about the meaning of the path name, which simplifies things a lot.
The similarity between queries and inner nodes doesn't have to end there: we can also generate synthetic requests for inner nodes and check whether we either get an OK (2xx) or client error (4xx) result.
If a site uses URLs such as
items/123/detail
, the item ID123
in the path should be recognized as an argument (as opposed to a unique page) such that the requests per page limit can be applied to it. Otherwise spidering an application with a large database behind it takes forever.The text was updated successfully, but these errors were encountered: