-
-
Notifications
You must be signed in to change notification settings - Fork 415
fix(robots2policy): handle multiple user agents under one block #925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
INPUT: https://lwn.net/robots.txt OLDjson@parser anubis % robots2policy -input https://lwn.net/robots.txt
- action: CHALLENGE
expression: path.startsWith("/Search")
name: robots-txt-policy-disallow-1
- action: CHALLENGE
expression: path.startsWith("/ml")
name: robots-txt-policy-disallow-2
- action: DENY
expression: userAgent.contains("ChatGLM-Spider")
name: robots-txt-policy-blacklist-3NEWjson@parser anubis % go run ./cmd/robots2policy -input https://lwn.net/robots.txt
- action: CHALLENGE
expression: path.startsWith("/Search")
name: robots-txt-policy-disallow-1
- action: CHALLENGE
expression: path.startsWith("/ml")
name: robots-txt-policy-disallow-2
- action: DENY
expression:
any:
- userAgent.contains("CCBot")
- userAgent.contains("MJ12bot")
- userAgent.contains("Mail.RU_Bot")
- userAgent.contains("Mail.RU_Bot/2.0")
- userAgent.contains("MegaIndex")
- userAgent.contains("MegaIndex.ru")
- userAgent.contains("trendkite-akashic-crawler")
- userAgent.contains("Jooblebot")
- userAgent.contains("HTTrack")
- userAgent.contains("yacybot")
- userAgent.contains("PetalBot")
- userAgent.contains("GPTBot")
- userAgent.contains("SemrushBot")
- userAgent.contains("AhrefsBot")
- userAgent.contains("Qwantbot")
- userAgent.contains("meta-externalagent")
- userAgent.contains("ChatGLM-Spider")
name: robots-txt-policy-blacklist-3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the robots2policy tool to properly handle multiple consecutive user agents in robots.txt files by grouping them into any: expressions instead of only processing the last one. Previously, the tool would lose all but the final user agent when multiple consecutive user-agent directives appeared before rules like disallow or crawl-delay.
- Refactored the robots.txt parser to accumulate multiple user agents before processing directives
- Updated the rule generation logic to create appropriate
any:expressions for multiple user agents - Added comprehensive test coverage for the consecutive user agent scenario
Reviewed Changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/docs/CHANGELOG.md | Documents the bug fix for proper grouping of consecutive user agents |
| cmd/robots2policy/testdata/simple.json | Reorders JSON fields (action moved to end) |
| cmd/robots2policy/testdata/consecutive.yaml | New test expectation file showing proper any: grouping for multiple user agents |
| cmd/robots2policy/testdata/consecutive.robots.txt | New test input file with multiple consecutive user agent scenarios |
| cmd/robots2policy/robots2policy_test.go | Adds test case for consecutive user agents functionality |
| cmd/robots2policy/main.go | Core implementation changes to properly accumulate and group multiple user agents |
Closes #857
Checklist:
[Unreleased]section of docs/docs/CHANGELOG.mdnpm run test:integration(unsupported on Windows, please use WSL)