Skip to content

search_calls_per_second needs to be dialed down #137

@edsu

Description

@edsu

I was running some fairly simple data retrieval in this Notebook(see the Wayback section) and I discovered that I got completely blocked from accessing web.archive.org! Luckily I remembered that there was Internet Archive's #wayback-researchers Slack channel, where I got this reponse.

Hi edsu - I found your /cdx requests from 4:29UTC. Those requests are limited to an average of 60/min. Over that and we start sending 429s. If 429s are ignored for more than a minute we block the IP at the firewall (no connection) for 1 hour, which is what happened to you. Subsequent 429s over a given period will double that time each occurrence. If you can keeping your api request < 60/minute you will prevent this from happening.

I thought that the openwayback module's defaults would have prevented me from going over the 60 requests per minute (one per second) and I thought wayback's support for handling 429 responses would have backed off sufficiently fast. I suspect that the goal posts on the server side have changed recently because I had code that worked a month or so ago, which stopped working (resulting in the block).

I was able to get around this by using a custom WaybackSession where I set the search_calls_per_second to 0.5, but I suspect 1.0 would probably work better. Maybe the default could me moved down from 1.5 to 1.0?

Also, perhaps there needs to be some logic to make sure to wait a minute when encountering a 429 as well?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Unreleased

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions