Skip to content

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Jan 23, 2025

This is pretty ugly, but it's a working first cut at the problem and fixes #663. This is suddenly more important since IA seems to be under a lot of load with peopole checking up on Trump-related website changes, and I'm seeing a lot more import-related failures on things that work fine on the processing side and fail to re-download in the DB importer. This will short-circuit that issue.

This ideally would go in a separate worker component, but we just don’t have the right data available from outside WaybackRecordsWorker right now. This would also ideally be tested more, but 🤷

This is pretty ugly, but it's a working first cut at the problem and fixes #663. This is suddenly more important since IA seems to be under a lot of load with peopole checking up on Trump-related website changes, and I'm seeing a lot more import-related failures on things that work fine on the processing side and fail to re-download in the DB importer. This will short-circuit that issue.
@Mr0grog
Copy link
Member Author

Mr0grog commented Jan 24, 2025

Turns out this is uploading data with an invalid binary/octet-stream media type. We should use the Content-Type header from the memento or, if not present, the sniffed media type (or we should just use the sniffed type?).

A quick look at Cloudpathlib suggests this might be hard. We could configure gzip encoding and a standard ACL for all uploads, but we’d have to construct a custom client for each upload now (I don’t think this is high overhead, but… I dunno?). Maybe need to drop down to boto3 and drop cloudpathlib (for this use case at least).

Also while I’m here, we should keep a list of seen hashes in memory so we don’t waste effort checking S3 if we encounter something a second time.

@Mr0grog Mr0grog marked this pull request as draft January 24, 2025 04:21
@Mr0grog Mr0grog marked this pull request as ready for review January 28, 2025 21:23
@Mr0grog Mr0grog merged commit 5d21b4a into main Jan 28, 2025
4 checks passed
@Mr0grog Mr0grog deleted the 663-upload-before-you-upload-please branch January 28, 2025 21:30
Mr0grog added a commit that referenced this pull request Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Import script should upload bodies directly to S3

1 participant