Skip to content

Conversation

@asadurski
Copy link
Member

Adds a custom cache that stores raw responses from AutoExtract to ScrapyCloud collections.
Adds settings:

  • AUTOEXTRACT_CACHE_COLLECTION
  • DEV_PROJECT
    Adds docs.

(Bonus: minor fixes to docs).

@codecov
Copy link

codecov bot commented Nov 30, 2021

Codecov Report

Merging #30 (6c0d2fe) into master (5baa342) will decrease coverage by 3.11%.
The diff coverage is 54.16%.

@@            Coverage Diff             @@
##           master      #30      +/-   ##
==========================================
- Coverage   85.24%   82.12%   -3.12%     
==========================================
  Files           9        9              
  Lines         488      526      +38     
==========================================
+ Hits          416      432      +16     
- Misses         72       94      +22     
Impacted Files Coverage Δ
scrapy_autoextract/utils.py 36.36% <16.66%> (-23.64%) ⬇️
scrapy_autoextract/cache.py 60.27% <50.00%> (-4.02%) ⬇️
scrapy_autoextract/providers.py 91.30% <83.33%> (-1.46%) ⬇️

)

def close(self):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be closing the sc client here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Thanks!

Copy link
Contributor

@ivanprado ivanprado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asadurski it looks great. I left a minor comment. The approach in general makes sense to me.

If I understand it well, it will create a collection using the jobid, so it will be empty, and then for every request there will be one request to the cache, even if nothing will be found there (because the request is new). This can be inefficient and can make crawling slower for no reason.

At the same time, it would be nice to have a reusable cache that is using collections. I imagine that all jobs for a particular spider could be using this cache, and this will speed up it by a lot!

So I propose here may be to be more flexible in the configuration:

  • To have a write-only flag, so that the responses are saved but no requests are done against the cache.
  • To have a way to customize the name of the collection to be used. So that it can be reused across jobs.


@classmethod
def fingerprint(cls, request: Request) -> str:
return request.url
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey! What's the reason just an URL is used as a fingerprint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that probably wasn't optimal. Introduced the same fingerprint mechanism as in AutoExtractCache.
It was like this for simplicity - those cached items are later retrieved in QA tool, MATT, and compared with job items. Now I just have to modify the retrieval in MATT to reach to item contents instead of the collection key.

@ivanprado
Copy link
Contributor

Hi @asadurski . I'm moving the cache mechanism to scrapy-poet here scrapinghub/scrapy-poet#55. In the future, it might be a good idea to remove the cache from here at all and putting all there. Meanwhile, we can keep the one here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants