Skip to content

hubspotdev/devrel-labs-llms-generator

Repository files navigation

HubSpot DevRel Labs LLMs.txt File Generator

This repo houses the code for the HubSpot DevRel Labs LLMs.txt File Generator. This is a project that generates llms.txt files from website content. We're using the HubSpot Projects framework, so that you can use this app in your HubSpot account.

Requirements

There are a few things that must be set up before you can make use of this getting started project.

Usage

This repo houses both a Node.js version local version and a Vercel serverless function version. The node.js version is located in the local-node-version folder, and the Vercel serverless function is located in the vercel-serverless folder.

This project is also a HubSpot Projects version 2025.2 project that you can upload to your HubSpot account. It uses static auth that you can install in a singular portal.

The HubSpot CLI enables you to run this project locally so that you may test and iterate quickly. Getting started is simple, just run this HubSpot CLI command in the hubspot-project directory and follow the prompts:

hs project dev

Setup

Upload the project to your HubSpot account either using the HubSpot CLI or by using the HubSpot developer MCP server. For more information on how to use the HubSpot project, please refer to the HubSpot Project README.md file.

Then, upload your Vercel serverless function to Vercel and retrieve the API endpoint and update the API endpoint in your app settings TSX file. For more information on how to use the Vercel serverless function, please refer to the Vercel Serverless Function README.md file.

To run the local version, navigate to the local-node-version folder and run the following command:

npm run build
node dist/index.js <url-to-process> <optional-url-limit>

The output will be written to a txt file in the output directory. If you've already run the script, you can run it again, and a new file will be created, using the following naming convention: llms-1.txt, llms-2.txt, etc. For more information on how to use the local version, please refer to the Local Node Version README.md file.

Architecture: How the HubSpot App Settings Page Frontend Connects to the Vercel Serverless Function Backend

This project uses a distributed architecture where a HubSpot UI Extension communicates with a Vercel serverless function to process web content and generate llms.txt files.

┌───────────────────────────────────────────────────────────────────┐ 
│                         HubSpot Account                           │
│  ┌────────────────────────────────────────────────────────────┐   │
│  │  UI Extension (App Settings Page)                          │   │
│  │  - User inputs sitemap/single URL                          │   │
│  │  - Selects processing mode (sitemap/single URL)            │   │
│  │  - Sets URL limit (optional)                               │   │
│  └──────────────────────┬─────────────────────────────────────┘   │
│                         │                                         │
│                         │ 1. hubspot.fetch() POST request         │
│                         │    {mode, url, urlLimit, portalId}      │
│                         ▼                                         │
│              ┌─────────────────────────┐                          │
│              │  Vercel Serverless API  │                          │
│              │  /api/sitemap-processor │                          │
│              └──────────┬──────────────┘                          │
│                         │                                         │
│                         │ 2. Process Request                      │
│                         │    - Validate payload                   │
│                         │    - Fetch sitemap/URL                  │
│                         ▼    - Extract content                    │
│              ┌─────────────────────────┐                          │
│              │  Content Processing     │                          │
│              │  - Fetch URLs           │                          │
│              │  - Parse HTML (cheerio) │                          │
│              │  - Extract text content │                          │
│              └──────────┬──────────────┘                          │
│                         │                                         │
│                         │ 3. Upload to HubSpot                    │
│                         │    POST `/files/v3/files`               │
│                         ▼    (Bearer Token Auth)                  │
│              ┌─────────────────────────┐                          │
│              │  HubSpot Files API      │                          │
│              │  - Creates llms-*.txt   │                          │
│              │  - Stores in /llm-gen.. │                          │
│              └──────────┬──────────────┘                          │
│                         │                                         │
│                         │ 4. Return Response                      │
│                         │    {success, fileUrl, ...}              │
│                         ▼                                         │
│  ┌────────────────────────────────────────────────────────────┐   │
│  │  UI Extension - Display Results                            │   │
│  │  - Shows success/error alert                               │   │
│  │  - Displays link to file in File Manager                   │   │
│  │  - Shows processing stats (URLs processed, time elapsed)   │   │
│  └────────────────────────────────────────────────────────────┘   │
└───────────────────────────────────────────────────────────────────┘

Data Flow Details

1. User Interaction (HubSpot UI Extension)

  • Component: LLMsGeneratorSettingsPage.tsx
  • User selects mode: sitemap or single_url
  • Enters target URL and optional URL limit
  • Clicks "Generate LLM File" button

2. API Request

  • Uses hubspot.fetch() to make CORS-compliant POST request
  • Endpoint: Configured in VERCEL_CONFIG.endpoint
  • Payload includes: {mode, url, urlLimit, portalId}
  • Portal ID automatically retrieved from context.portal.id

3. Vercel Serverless Processing

  • Function: /api/sitemap-processor.ts
  • Validates request and extracts parameters
  • Fetches sitemap XML or single URL
  • Iterates through URLs extracting text content
  • Implements rate limiting and timeout protection (14 min max)

4. Content Extraction

  • Removes scripts, styles, and non-content elements using regex
  • Extracts clean text from main content areas using regex
  • Formats with page headers and metadata using regex

5. HubSpot File Upload

  • Service: hubspot-uploader.ts
  • API: HubSpot Files API v3 (/files/v3/files)
  • Authentication: Bearer token from HUBSPOT_ACCESS_TOKEN env var
  • Uploads to /llm-generator folder
  • Sets file as PUBLIC_INDEXABLE

6. Response & Display

  • Returns JSON with: {success, fileUrl, processedUrls, totalUrls, siteName, timeElapsed}
  • UI displays success alert with clickable link
  • Link opens File Manager directly to the generated file

Key Configuration Requirements

  • HubSpot Project: Must update VERCEL_CONFIG.endpoint in LLMsGeneratorSettingsPage.tsx
  • Vercel Function: Must set HUBSPOT_ACCESS_TOKEN environment variable
  • CORS: Vercel function configured to allow https://app.hubspot.com origin
  • Timeouts: Function respects Vercel's 15-minute serverless limit

Using GitHub Actions

This project uses GitHub Actions to auto-deploy the project to your HubSpot account when you push to the main branch. Your GitHub Actions workflow file is located in the .github/workflows/main.yml file.

In your GitHub repository, create two new secrets for:

  • HUBSPOT_ACCOUNT_ID: the ID of your HubSpot account.
  • HUBSPOT_PERSONAL_ACCESS_KEY: your personal access key.

Dependencies

This project relies on a serverless function to process the sitemap/URL to parse the content and generate the LLMs.txt file. The function then uses the HubSpot Files API to upload the file to the HubSpot File Manager. To call the function, the app uses the hubspot.fetch function to make a POST request to an API endpoint that the function interacts with.

The local version relies on Node.js to run the script. It uses the axios library to fetch the content of the URL and the cheerio library to parse the content. It uses the fs library to write the content to a txt file. It uses the path library to create the output directory. It uses the url library to normalize the URL. It uses the __dirname library to get the current directory.

Why Use a Third-party Backend Service?

This project uses a third-party backend service for several strategic advantages:

Extended Processing Time

  • Better processing: Depending on the backend service you choose, you're processing time can be extended so that users can generate comprehensive llms.txt files from entire website sitemaps in a single operation

Independent Scaling & Performance

  • Resource Isolation: The backend scales independently from your HubSpot account, preventing heavy processing from impacting other HubSpot operations
  • Optimized Environment: Third-party platforms are specifically designed for compute-intensive tasks like web scraping and content parsing
  • Concurrent Processing: Handle multiple requests simultaneously without affecting HubSpot app responsiveness

Full Node.js Ecosystem Access

  • Library Freedom: Use any npm package without HubSpot's runtime restrictions (e.g., cheerio, axios, advanced XML parsers)
  • Native APIs: Direct access to Node.js APIs and filesystem operations during processing
  • Flexibility: Easily integrate with other services, databases, or APIs as needed

Simplified Development & Deployment

  • Independent Versioning: Update and deploy backend logic without re-uploading your entire HubSpot Project
  • Environment Variables: Securely manage sensitive credentials (like HUBSPOT_ACCESS_TOKEN) using your backend's environment variable system
  • Easy Testing: Test backend logic locally or in staging environments before affecting production HubSpot apps
  • CI/CD Integration: Leverage your backend's automatic deployments from Git repositories

Cost Optimization

  • Pay-per-Use: Only pay for compute resources when processing requests, rather than maintaining always-on infrastructurex
  • Reduced HubSpot Limits: Offload processing to avoid hitting HubSpot's API rate limits or execution quotas

Future Extensibility

  • Provider Flexibility: The same architecture pattern works with AWS Lambda, Google Cloud Functions, or any serverless platform
  • Feature Growth: Easily add new capabilities like PDF generation, image processing, or database storage without HubSpot constraints
  • Multi-Cloud: Deploy to multiple providers for redundancy or geographic distribution

When to Use This Pattern

This third-party backend pattern is ideal for HubSpot Projects that require:

  • Long-running operations (>10 seconds)
  • Heavy data processing or web scraping
  • Access to specialized npm packages
  • Independent scaling from HubSpot
  • Frequent backend updates without HubSpot redeployment

Contributions

This project is made with ❤️ by the HubSpot DevRel Team and is part of the HubSpot DevRel Labs initiative.

To contribute to this project, please create a new branch and submit a pull request. A member of the HubSpot DevRel Team will review the pull request and merge it into the main branch.

About

This repo will house the DevRel Labs llms.txt generator code.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published