-
-
Notifications
You must be signed in to change notification settings - Fork 13
Description
I was running some fairly simple data retrieval in this Notebook(see the Wayback section) and I discovered that I got completely blocked from accessing web.archive.org! Luckily I remembered that there was Internet Archive's #wayback-researchers Slack channel, where I got this reponse.
Hi edsu - I found your /cdx requests from 4:29UTC. Those requests are limited to an average of 60/min. Over that and we start sending 429s. If 429s are ignored for more than a minute we block the IP at the firewall (no connection) for 1 hour, which is what happened to you. Subsequent 429s over a given period will double that time each occurrence. If you can keeping your api request < 60/minute you will prevent this from happening.
I thought that the openwayback module's defaults would have prevented me from going over the 60 requests per minute (one per second) and I thought wayback's support for handling 429 responses would have backed off sufficiently fast. I suspect that the goal posts on the server side have changed recently because I had code that worked a month or so ago, which stopped working (resulting in the block).
I was able to get around this by using a custom WaybackSession where I set the search_calls_per_second to 0.5, but I suspect 1.0 would probably work better. Maybe the default could me moved down from 1.5 to 1.0?
Also, perhaps there needs to be some logic to make sure to wait a minute when encountering a 429 as well?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status