wrap braintrust to get llm usage data #637

kamath · 2025-04-05T02:21:27Z

why

we now have LLM metrics in braintrust!

what changed

Edited Stagehand Default Error to forward the stack trace/message as well
Moved LLM Client evals into separate directory (need to re-add on CI)
Made a unit CI test for testing core features in act
Refactored Evals to use Braintrust AI proxy and thereby support p much any LLM by name
Used Gemini AI SDK for Gemini models since their OpenAI proxy is super finicky with structured outputs
Wrapped LLM Client in Braintrust to get LLM metrics data
^ in order to do that I had to edit all evals to take in a Stagehand object instead of calling initStagehand in each eval

test plan

this is it

changeset-bot · 2025-04-05T02:21:30Z

🦋 Changeset detected

Latest commit: 142a8af

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@browserbasehq/stagehand	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

kamath · 2025-04-05T02:22:09Z

types/stagehandErrors.ts

+  constructor(error?: unknown) {
+    if (error instanceof Error || error instanceof StagehandError) {
+      super(
+        `\nHey! We're sorry you ran into an error. \nIf you need help, please open a Github issue or reach out to us on Slack: https://stagehand.dev/slack\n\nFull error:\n${error.message}`,


helps us see the stacktrace when StagehandDefaultError is thrown

kamath · 2025-04-05T02:22:31Z

stagehand.config.ts

@@ -3,7 +3,7 @@ import dotenv from "dotenv";
 dotenv.config();

 const StagehandConfig: ConstructorParams = {
-  verbose: 1 /* Verbosity level for logging: 0 = silent, 1 = info, 2 = all */,
+  verbose: 2 /* Verbosity level for logging: 0 = silent, 1 = info, 2 = all */,


can change back but nice to have

kamath · 2025-04-06T03:47:05Z

.changeset/stupid-ghosts-smash.md

+"@browserbasehq/stagehand": patch
+---
+
+Fix: forward along the stack trace in StagehandDefaultError


now it's throwing the error multiple times, need a fast follow PR to only throw StagehandDefaultError once

kamath · 2025-04-06T03:47:47Z

evals/llm_clients/hn_aisdk.ts

@@ -1,19 +1,22 @@
+import { Stagehand } from "@/dist";


Currently not running these in CI. Will add in a fast-follow PR

what is the run command for these? what is the benefit in taking them out of the tasks directory? Also make sure you remove the step from CI, otherwise it will keep failing

kamath · 2025-04-06T03:49:44Z

evals/args.ts

@@ -66,6 +66,8 @@ const DEFAULT_EVAL_CATEGORIES = process.env.EVAL_CATEGORIES
      "regression_llm_providers",
      "regression_text_extract",
      "regression_dom_extract",
+      "llm_clients",
+      "unit",


currently running neither in CI, will add in a fast-follow PR

evals/evals.config.json

greptile-apps

PR Summary

This PR centralizes Stagehand initialization to support Braintrust LLM metrics and standardizes error logging across evaluations.

/evals/evals.config.json: Reassigned evaluation categories and removed obsolete tasks.
/evals/initStagehand.ts: Removed modelName support; now requires a pre-initialized llmClient.
/evals/index.eval.ts: Wrapped LLM client initialization with Braintrust proxy and unified error forwarding.
/evals/tasks/*: All tasks now accept an externally provided stagehand instance (plus debugUrl/sessionUrl), eliminating internal initStagehand calls.
/evals/logger.ts & /lib/StagehandPage.ts: Improved error logging with full stack trace and message forwarding.

_{88 file(s) reviewed, 1 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-04-06T04:01:48Z

types/stagehandErrors.ts

+  constructor(error?: unknown) {
+    if (error instanceof Error || error instanceof StagehandError) {
+      super(
+        `\nHey! We're sorry you ran into an error. \nIf you need help, please open a Github issue or reach out to us on Slack: https://stagehand.dev/slack\n\nFull error:\n${error.message}`,
+      );
+    }
  }


logic: Ensure the constructor always calls super, even when error is undefined or not an instance of Error.

Suggested change

constructor(error?: unknown) {

if (error instanceof Error || error instanceof StagehandError) {

super(

`\nHey! We're sorry you ran into an error. \nIf you need help, please open a Github issue or reach out to us on Slack: https://stagehand.dev/slack\n\nFull error:\n${error.message}`,

);

}

}

constructor(error?: unknown) {

if (error instanceof Error || error instanceof StagehandError) {

super(

`\nHey! We're sorry you ran into an error. \nIf you need help, please open a Github issue or reach out to us on Slack: https://stagehand.dev/slack\n\nFull error:\n${error.message}`,

);

} else {

super('An unknown error occurred. If you need help, please open a Github issue or reach out to us on Slack: https://stagehand.dev/slack');

}

}

seanmcguire12 · 2025-04-07T17:41:20Z

evals/initStagehand.ts

+  sessionUrl: string;
+  useTextExtract: boolean;
+  stagehandConfig: ConstructorParams;
+};


MoveStagehandInitResult to types/evals.ts

evals/taskConfig.ts

evals/tasks/allrecipes.ts

seanmcguire12 · 2025-04-07T17:52:18Z

evals/tasks/expect_act_timeout_global.ts

why delete?

Co-authored-by: Sean McGuire <[email protected]>

kamath commented Apr 5, 2025

View reviewed changes

kamath force-pushed the anirudh/wrap-braintrust branch from cd6e068 to 3bc4a7c Compare April 5, 2025 05:53

kamath changed the title ~~[wip] wrap braintrust to get llm usage data~~ wrap braintrust to get llm usage data Apr 6, 2025

kamath commented Apr 6, 2025

View reviewed changes

kamath marked this pull request as ready for review April 6, 2025 03:52

greptile-apps bot reviewed Apr 6, 2025

View reviewed changes

seanmcguire12 reviewed Apr 7, 2025

View reviewed changes

evals/taskConfig.ts Outdated Show resolved Hide resolved

seanmcguire12 reviewed Apr 7, 2025

View reviewed changes

evals/tasks/allrecipes.ts Outdated Show resolved Hide resolved

seanmcguire12 reviewed Apr 7, 2025

View reviewed changes

evals/tasks/expect_act_timeout_global.ts Outdated

Copy link

Member

seanmcguire12 Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete?

kamath and others added 12 commits April 10, 2025 09:29

temp

a400a7d

temp

4270b1a

custom openai

2dfabd7

unit and evals

b900711

changeset

be5449a

fix evals config

70ed067

fix evals config

1a3d739

Update evals/taskConfig.ts

a1e6320

Co-authored-by: Sean McGuire <[email protected]>

rebase

367d9d8

temp

7da5f54

all eval tasks

9a23a75

address comments and remove hn from ci

78a21b1

kamath force-pushed the anirudh/wrap-braintrust branch from 2801646 to 78a21b1 Compare April 10, 2025 17:21

kamath and others added 6 commits April 10, 2025 11:14

press enter

e2c0df3

dont use braintrust ai proxy

879d3e9

fix amazon eval

7e71bf0

remove WRITE_FILE check

0b09e13

revert amazon to act category

e67606e

unify regression evals

5cab2d5

seanmcguire12 added 4 commits April 10, 2025 14:20

update CI

591b64d

fix job naming

d25a6cb

wrap in try catch

6f8ecb7

fix yml

f5aef1c

seanmcguire12 added 2 commits April 10, 2025 17:47

update other amazon eval

426116d

vanta_h experimental

993b8ca

seanmcguire12 added the text-extract label Apr 11, 2025

Merge branch 'main' into anirudh/wrap-braintrust

c7781e8

seanmcguire12 self-requested a review April 11, 2025 01:39

seanmcguire12 approved these changes Apr 11, 2025

View reviewed changes

add text_extract eval category to CI

142a8af

seanmcguire12 added the targeted-extract These changes pertain to targeted extract label Apr 11, 2025

kamath merged commit 944bbbf into main Apr 11, 2025
14 of 26 checks passed

github-actions bot mentioned this pull request Apr 11, 2025

Version Packages #632

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wrap braintrust to get llm usage data #637

wrap braintrust to get llm usage data #637

Uh oh!

kamath commented Apr 5, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Apr 5, 2025 •

edited

Loading

Uh oh!

kamath Apr 5, 2025

Uh oh!

kamath Apr 5, 2025 •

edited

Loading

Uh oh!

kamath Apr 6, 2025

Uh oh!

kamath Apr 6, 2025

Uh oh!

seanmcguire12 Apr 7, 2025

Uh oh!

kamath Apr 6, 2025

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Apr 6, 2025

Uh oh!

seanmcguire12 Apr 7, 2025

Uh oh!

Uh oh!

Uh oh!

seanmcguire12 Apr 7, 2025

Uh oh!

Uh oh!

Uh oh!

wrap braintrust to get llm usage data #637

wrap braintrust to get llm usage data #637

Uh oh!

Conversation

kamath commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Uh oh!

changeset-bot bot commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

kamath Apr 5, 2025

Choose a reason for hiding this comment

Uh oh!

kamath Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kamath Apr 6, 2025

Choose a reason for hiding this comment

Uh oh!

kamath Apr 6, 2025

Choose a reason for hiding this comment

Uh oh!

seanmcguire12 Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

kamath Apr 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

Uh oh!

greptile-apps bot Apr 6, 2025

Choose a reason for hiding this comment

Uh oh!

seanmcguire12 Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

seanmcguire12 Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kamath commented Apr 5, 2025 •

edited

Loading

changeset-bot bot commented Apr 5, 2025 •

edited

Loading

kamath Apr 5, 2025 •

edited

Loading