Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide robots.txt for api.dandiarchive.org disabling all compliant bots? #1272

Closed
yarikoptic opened this issue Sep 8, 2022 · 2 comments
Closed
Labels
DX Affects developer experience maintenance Action to maintain the system (neither a bugfix nor an enhancement) security

Comments

@yarikoptic
Copy link
Member

I do not think there is a point in allowing bots to traverse api.dandiarchive.org in its entirety. I just spotted today in the logs entries such as

(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep petalbot 20220908-*.log | head -n 1
20220908-0301.log:2022-09-08T07:20:20.557120+00:00 app[web.1]: 10.1.62.48 - - [08/Sep/2022:07:20:20 +0000] "GET /api/dandisets/000021/versions/ HTTP/1.1" 200 12478 "https://api.dandiarchive.org/api/dandisets/000021/versions/draft/assets/paths/" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"

it is not that many of those requests but they do traverse the website slowly and it seems even asking for robots.txt which we do not provide. Here are the hits for today so far with most popular "paths" on the server presumably from announced bots:

(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep bot 20220908-*.log | awk '{print $9;}' | sort | uniq -c | sort -n | nl | tail
    93        3 /static/drf-yasg/swagger-ui-dist/swagger-ui-standalone-preset.js
    94        3 /swagger/?format=openapi
    95        4 /api/dandisets/?page=1&page_size=1&ordering=-modified&draft=true
    96        4 /api/stats/
    97        4 /swagger/
    98        8 /
    99       10 /api/info/
   100       10 /robots.txt
   101       13 dyno=web.1
   102       18

and for petalbot

(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep petalbot 20220908-*.log | awk '{print $9;}' | sort | uniq -c | sort -n | nl | tail
    68        2 /api/dandisets/000238/users/
    69        2 /api/dandisets/000238/versions/
    70        2 /api/dandisets/000238/versions/draft/assets/paths/?path_prefix=&page=1&page_size=15
    71        2 /api/dandisets/000238/versions/draft/info/
    72        2 /api/dandisets/000239/versions/draft/assets/?page=3
    73        2 /api/dandisets/000295/versions/
    74        2 /api/zarr/?page=10
    75        2 /api/zarr/?page=36
    76        4 /api/dandisets/?page=1&page_size=1&ordering=-modified&draft=true
    77        8 /api/info/
@yarikoptic yarikoptic added the DX Affects developer experience label Sep 8, 2022
@waxlamp waxlamp added the maintenance Action to maintain the system (neither a bugfix nor an enhancement) label Mar 2, 2023
@jwodder
Copy link
Member

jwodder commented Nov 23, 2024

The following robots.txt should disable all robots:

User-agent: *
Disallow: /

@jjnesbitt
Copy link
Member

Closed via #2084 (not deployed to prod yet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DX Affects developer experience maintenance Action to maintain the system (neither a bugfix nor an enhancement) security
Projects
None yet
Development

No branches or pull requests

4 participants