Skip to content

Conversation

@JasonLovesDoggo
Copy link
Member

Closes #857

Checklist:

  • Added a description of the changes to the [Unreleased] section of docs/docs/CHANGELOG.md
  • Added test cases to the relevant parts of the codebase
  • Ran integration tests npm run test:integration (unsupported on Windows, please use WSL)

@JasonLovesDoggo
Copy link
Member Author

INPUT: https://lwn.net/robots.txt

OLD
json@parser anubis % robots2policy  -input https://lwn.net/robots.txt
- action: CHALLENGE
  expression: path.startsWith("/Search")
  name: robots-txt-policy-disallow-1
- action: CHALLENGE
  expression: path.startsWith("/ml")
  name: robots-txt-policy-disallow-2
- action: DENY
  expression: userAgent.contains("ChatGLM-Spider")
  name: robots-txt-policy-blacklist-3
NEW
json@parser anubis % go run ./cmd/robots2policy -input https://lwn.net/robots.txt
- action: CHALLENGE
  expression: path.startsWith("/Search")
  name: robots-txt-policy-disallow-1
- action: CHALLENGE
  expression: path.startsWith("/ml")
  name: robots-txt-policy-disallow-2
- action: DENY
  expression:
    any:
    - userAgent.contains("CCBot")
    - userAgent.contains("MJ12bot")
    - userAgent.contains("Mail.RU_Bot")
    - userAgent.contains("Mail.RU_Bot/2.0")
    - userAgent.contains("MegaIndex")
    - userAgent.contains("MegaIndex.ru")
    - userAgent.contains("trendkite-akashic-crawler")
    - userAgent.contains("Jooblebot")
    - userAgent.contains("HTTrack")
    - userAgent.contains("yacybot")
    - userAgent.contains("PetalBot")
    - userAgent.contains("GPTBot")
    - userAgent.contains("SemrushBot")
    - userAgent.contains("AhrefsBot")
    - userAgent.contains("Qwantbot")
    - userAgent.contains("meta-externalagent")
    - userAgent.contains("ChatGLM-Spider")
  name: robots-txt-policy-blacklist-3

@JasonLovesDoggo JasonLovesDoggo marked this pull request as ready for review September 7, 2025 02:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the robots2policy tool to properly handle multiple consecutive user agents in robots.txt files by grouping them into any: expressions instead of only processing the last one. Previously, the tool would lose all but the final user agent when multiple consecutive user-agent directives appeared before rules like disallow or crawl-delay.

  • Refactored the robots.txt parser to accumulate multiple user agents before processing directives
  • Updated the rule generation logic to create appropriate any: expressions for multiple user agents
  • Added comprehensive test coverage for the consecutive user agent scenario

Reviewed Changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
docs/docs/CHANGELOG.md Documents the bug fix for proper grouping of consecutive user agents
cmd/robots2policy/testdata/simple.json Reorders JSON fields (action moved to end)
cmd/robots2policy/testdata/consecutive.yaml New test expectation file showing proper any: grouping for multiple user agents
cmd/robots2policy/testdata/consecutive.robots.txt New test input file with multiple consecutive user agent scenarios
cmd/robots2policy/robots2policy_test.go Adds test case for consecutive user agents functionality
cmd/robots2policy/main.go Core implementation changes to properly accumulate and group multiple user agents

@JasonLovesDoggo JasonLovesDoggo changed the title fix: handle multiple user agents fix(robots2policy): handle multiple user agents under one block Sep 7, 2025
@JasonLovesDoggo JasonLovesDoggo merged commit 82099d9 into main Sep 7, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(robots2policy): Handle multiple User-agents

2 participants