autoevals

AutoEvals is a tool to quickly and easily evaluate AI model outputs.

Quickstart

npm install autoevals

Example

Use AutoEvals to model-grade an example LLM completion using the factuality prompt.

import { Factuality } from "autoevals";
 
(async () => {
  const input = "Which country has the highest population?";
  const output = "People's Republic of China";
  const expected = "China";
 
  const result = await Factuality({ output, expected, input });
  console.log(`Factuality score: ${result.score}`);
  console.log(`Factuality metadata: ${result.metadata?.rationale}`);
})();

Interfaces

Namespaces

Functions

AnswerCorrectness

AnswerCorrectness(args): Score | Promise<Score>

Measures answer correctness compared to ground truth using a weighted average of factuality and semantic similarity.

Parameters

NameType
argsScorerArgs<string, { context?: string | string[] ; input?: string ; model?: string } & { maxTokens?: number ; temperature?: number } & OpenAIAuth & { answerSimilarity?: Scorer<string, {}> ; answerSimilarityWeight?: number ; factualityWeight?: number }>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


AnswerRelevancy

AnswerRelevancy(args): Score | Promise<Score>

Scores the relevancy of the generated answer to the given question. Answers with incomplete, redundant or unnecessary information are penalized.

Parameters

NameType
argsScorerArgs<string, { context?: string | string[] ; input?: string ; model?: string } & { maxTokens?: number ; temperature?: number } & OpenAIAuth & { strictness?: number } & RagasEmbeddingModelArgs>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


AnswerSimilarity

AnswerSimilarity(args): Score | Promise<Score>

Scores the semantic similarity between the generated answer and ground truth.

Parameters

NameType
argsScorerArgs<string, RagasArgs>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


Battle

Battle(args): Score | Promise<Score>

Test whether an output better performs the instructions than the original (expected) value.

Parameters

NameType
argsScorerArgs<string, LLMClassifierArgs<{ instructions: string }>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


ClosedQA

ClosedQA(args): Score | Promise<Score>

Test whether an output answers the input using knowledge built into the model. You can specify criteria to further constrain the answer.

Parameters

NameType
argsScorerArgs<string, LLMClassifierArgs<{ criteria: any ; input: string }>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


ContextEntityRecall

ContextEntityRecall(args): Score | Promise<Score>

Estimates context recall by estimating TP and FN using annotated answer and retrieved context.

Parameters

NameType
argsScorerArgs<string, { context?: string | string[] ; input?: string ; model?: string } & { maxTokens?: number ; temperature?: number } & OpenAIAuth & { pairwiseScorer?: Scorer<string, {}> }>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


ContextPrecision

ContextPrecision(args): Score | Promise<Score>

Parameters

NameType
argsScorerArgs<string, RagasArgs>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


ContextRecall

ContextRecall(args): Score | Promise<Score>

Parameters

NameType
argsScorerArgs<string, RagasArgs>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


ContextRelevancy

ContextRelevancy(args): Score | Promise<Score>

Parameters

NameType
argsScorerArgs<string, RagasArgs>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


EmbeddingSimilarity

EmbeddingSimilarity(args): Score | Promise<Score>

A scorer that uses cosine similarity to compare two strings.

Parameters

NameType
argsScorerArgs<string, { expectedMin?: number ; model?: string ; prefix?: string } & OpenAIAuth>

Returns

Score | Promise<Score>

A score between 0 and 1, where 1 is a perfect match.

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


ExactMatch

ExactMatch(args): Score | Promise<Score>

A simple scorer that tests whether two values are equal. If the value is an object or array, it will be JSON-serialized and the strings compared for equality.

Parameters

NameType
argsObject

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


Factuality

Factuality(args): Score | Promise<Score>

Test whether an output is factual, compared to an original (expected) value.

Parameters

NameType
argsScorerArgs<string, LLMClassifierArgs<{ expected?: string ; input: string ; output: string }>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


Faithfulness

Faithfulness(args): Score | Promise<Score>

Measures factual consistency of the generated answer with the given context.

Parameters

NameType
argsScorerArgs<string, RagasArgs>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


Humor

Humor(args): Score | Promise<Score>

Test whether an output is funny.

Parameters

NameType
argsScorerArgs<string, LLMClassifierArgs<{}>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


JSONDiff

JSONDiff(args): Score | Promise<Score>

A simple scorer that compares JSON objects, using a customizable comparison method for strings (defaults to Levenshtein) and numbers (defaults to NumericDiff).

Parameters

NameType
argsScorerArgs<any, { numberScorer?: Scorer<number, {}> ; stringScorer?: Scorer<string, {}> }>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


LLMClassifierFromSpec

LLMClassifierFromSpec<RenderArgs>(name, spec): Scorer<any, LLMClassifierArgs<RenderArgs>>

Type parameters

Name
RenderArgs

Parameters

NameType
namestring
specObject
spec.choice_scoresRecord<string, number>
spec.model?string
spec.promptstring
spec.temperature?number
spec.use_cot?boolean

Returns

Scorer<any, LLMClassifierArgs<RenderArgs>>

Defined in

autoevals/js/llm.ts:251


LLMClassifierFromSpecFile

LLMClassifierFromSpecFile<RenderArgs>(name, templateName): Scorer<any, LLMClassifierArgs<RenderArgs>>

Type parameters

Name
RenderArgs

Parameters

NameType
namestring
templateName"battle" | "closed_q_a" | "factuality" | "humor" | "possible" | "security" | "sql" | "summary" | "translation"

Returns

Scorer<any, LLMClassifierArgs<RenderArgs>>

Defined in

autoevals/js/llm.ts:265


LLMClassifierFromTemplate

LLMClassifierFromTemplate<RenderArgs>(«destructured»): Scorer<string, LLMClassifierArgs<RenderArgs>>

Type parameters

Name
RenderArgs

Parameters

NameType
«destructured»Object
› choiceScoresRecord<string, number>
› model?string
› namestring
› promptTemplatestring
› temperature?number
› useCoT?boolean

Returns

Scorer<string, LLMClassifierArgs<RenderArgs>>

Defined in

autoevals/js/llm.ts:195


Levenshtein

Levenshtein(args): Score | Promise<Score>

A simple scorer that uses the Levenshtein distance to compare two strings.

Parameters

NameType
argsObject

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


LevenshteinScorer

LevenshteinScorer(args): Score | Promise<Score>

Parameters

NameType
argsObject

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


ListContains

ListContains(args): Score | Promise<Score>

A scorer that semantically evaluates the overlap between two lists of strings. It works by computing the pairwise similarity between each element of the output and the expected value, and then using Linear Sum Assignment to find the best matching pairs.

Parameters

NameType
argsScorerArgs<string[], { allowExtraEntities?: boolean ; pairwiseScorer?: Scorer<string, {}> }>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


Moderation

Moderation(args): Score | Promise<Score>

A scorer that uses OpenAI's moderation API to determine if AI response contains ANY flagged content.

Parameters

NameType
argsScorerArgs<string, { threshold?: number } & OpenAIAuth>

Returns

Score | Promise<Score>

A score between 0 and 1, where 1 means content passed all moderation checks.

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


NumericDiff

NumericDiff(args): Score | Promise<Score>

A simple scorer that compares numbers by normalizing their difference.

Parameters

NameType
argsObject

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


OpenAIClassifier

OpenAIClassifier<RenderArgs, Output>(args): Promise<Score>

Type parameters

Name
RenderArgs
Output

Parameters

NameType
argsScorerArgs<Output, OpenAIClassifierArgs<RenderArgs>>

Returns

Promise<Score>

Defined in

autoevals/js/llm.ts:84


Possible

Possible(args): Score | Promise<Score>

Test whether an output is a possible solution to the challenge posed in the input.

Parameters

NameType
argsScorerArgs<string, LLMClassifierArgs<{ input: string }>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


Security

Security(args): Score | Promise<Score>

Test whether an output is malicious.

Parameters

NameType
argsScorerArgs<string, LLMClassifierArgs<{}>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


Sql

Sql(args): Score | Promise<Score>

Test whether a SQL query is semantically the same as a reference (output) query.

Parameters

NameType
argsScorerArgs<string, LLMClassifierArgs<{ input: string }>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


Summary

Summary(args): Score | Promise<Score>

Test whether an output is a better summary of the input than the original (expected) value.

Parameters

NameType
argsScorerArgs<string, LLMClassifierArgs<{ input: string }>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


Translation

Translation(args): Score | Promise<Score>

Test whether an output is as good of a translation of the input in the specified language as an expert (expected) value.

Parameters

NameType
argsScorerArgs<string, LLMClassifierArgs<{ input: string ; language: string }>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


ValidJSON

ValidJSON(args): Score | Promise<Score>

A binary scorer that evaluates the validity of JSON output, optionally validating against a JSON Schema definition (see https://json-schema.org/learn/getting-started-step-by-step#create).

Parameters

NameType
argsScorerArgs<string, { schema?: any }>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21


buildClassificationTools

buildClassificationTools(useCoT, choiceStrings): ChatCompletionTool[]

Parameters

NameType
useCoTboolean
choiceStringsstring[]

Returns

ChatCompletionTool[]

Defined in

autoevals/js/llm.ts:50


makePartial

makePartial<Output, Extra>(fn, name?): ScorerWithPartial<Output, Extra>

Type parameters

Name
Output
Extra

Parameters

NameType
fnScorer<Output, Extra>
name?string

Returns

ScorerWithPartial<Output, Extra>

Defined in

autoevals/js/partial.ts:11


normalizeValue

normalizeValue(value, maybeObject): string

Parameters

NameType
valueunknown
maybeObjectboolean

Returns

string

Defined in

autoevals/js/value.ts:29

Type Aliases

LLMArgs

Ƭ LLMArgs: { maxTokens?: number ; temperature?: number } & OpenAIAuth

Defined in

autoevals/js/llm.ts:19


LLMClassifierArgs

Ƭ LLMClassifierArgs<RenderArgs>: { model?: string ; useCoT?: boolean } & LLMArgs & RenderArgs

Type parameters

Name
RenderArgs

Defined in

autoevals/js/llm.ts:189


ModelGradedSpec

Ƭ ModelGradedSpec: z.infer<typeof modelGradedSpecSchema>

Defined in

autoevals/js/templates.ts:22


OpenAIClassifierArgs

Ƭ OpenAIClassifierArgs<RenderArgs>: { cache?: ChatCache ; choiceScores: Record<string, number> ; classificationTools: ChatCompletionTool[] ; messages: ChatCompletionMessageParam[] ; model: string ; name: string } & LLMArgs & RenderArgs

Type parameters

Name
RenderArgs

Defined in

autoevals/js/llm.ts:74

Variables

DEFAULT_MODEL

Const DEFAULT_MODEL: "gpt-4o"

Defined in

autoevals/js/llm.ts:24


Evaluators

Const Evaluators: { label: string ; methods: AutoevalMethod[] }[]

Defined in

autoevals/js/manifest.ts:37


modelGradedSpecSchema

Const modelGradedSpecSchema: ZodObject<{ choice_scores: ZodRecord<ZodString, ZodNumber> ; model: ZodOptional<ZodString> ; prompt: ZodString ; temperature: ZodOptional<ZodNumber> ; use_cot: ZodOptional<ZodBoolean> }, "strip", ZodTypeAny, { choice_scores: Record<string, number> ; model?: string ; prompt: string ; temperature?: number ; use_cot?: boolean }, { choice_scores: Record<string, number> ; model?: string ; prompt: string ; temperature?: number ; use_cot?: boolean }>

Defined in

autoevals/js/templates.ts:14


templates

Const templates: Record<"battle" | "closed_q_a" | "factuality" | "humor" | "possible" | "security" | "sql" | "summary" | "translation", { choice_scores: Record<string, number> ; model?: string ; prompt: string ; temperature?: number ; use_cot?: boolean }>

Defined in

autoevals/js/templates.ts:36