Local scorers are only available for the Weave Python SDK. They are not yet available for the Weave TypeScript SDK yet.To use Weave scorers in TypeScript, see function-based scorers.
Installation
To use Weaveโs predefined scorers you need to install some additional dependencies:model_id.
See the supported models here.
HallucinationFreeScorer
This scorer checks if your AI systemโs output includes any hallucinations based on the input data.
- Customize the
system_promptanduser_promptfields of the scorer to define what โhallucinationโ means for you.
- The
scoremethod expects an input column namedcontext. If your dataset uses a different name, use thecolumn_mapattribute to mapcontextto the dataset column.
SummarizationScorer
Use an LLM to compare a summary to the original text and evaluate the quality of the summary.
- Entity Density: Checks the ratio of unique entities (like names, places, or things) mentioned in the summary to the total word count in the summary in order to estimate the โinformation densityโ of the summary. Uses an LLM to extract the entities. Similar to how entity density is used in the Chain of Density paper, https://arxiv.org/abs/2309.04269
- Quality Grading: An LLM evaluator grades the summary as
poor,ok, orexcellent. These grades are then mapped to scores (0.0 for poor, 0.5 for ok, and 1.0 for excellent) for aggregate performance evaluation.
- Adjust
summarization_evaluation_system_promptandsummarization_evaluation_promptto tailor the evaluation process.
- The scorer uses litellm internally.
- The
scoremethod expects the original text (the one being summarized) to be present in theinputcolumn. Usecolumn_mapif your dataset uses a different name.
OpenAIModerationScorer
The OpenAIModerationScorer uses OpenAIโs Moderation API to check if the AI systemโs output contains disallowed content, such as hate speech or explicit material.
- Sends the AIโs output to the OpenAI Moderation endpoint and returns a structured response indicating if the content is flagged.
EmbeddingSimilarityScorer
The EmbeddingSimilarityScorer computes the cosine similarity between the embeddings of the AI systemโs output and a target text from your dataset. It is useful for measuring how similar the AIโs output is to a reference text.
threshold(float): The minimum cosine similarity score (between -1 and 1) needed to consider the two texts similar (defaults to0.5).
EmbeddingSimilarityScorer in the context of an evaluation:
ValidJSONScorer
The ValidJSONScorer checks whether the AI systemโs output is valid JSON. This scorer is useful when you expect the output to be in JSON format and need to verify its validity.
ValidXMLScorer
The ValidXMLScorer checks whether the AI systemโs output is valid XML. It is useful when expecting XML-formatted outputs.
PydanticScorer
The PydanticScorer validates the AI systemโs output against a Pydantic model to ensure it adheres to a specified schema or data structure.
RAGAS - ContextEntityRecallScorer
The ContextEntityRecallScorer estimates context recall by extracting entities from both the AI systemโs output and the provided context, then computing the recall score. It is based on the RAGAS evaluation library.
- Uses an LLM to extract unique entities from the output and context and calculates recall.
- Recall indicates the proportion of important entities from the context that are captured in the output.
- Returns a dictionary with the recall score.
- Expects a
contextcolumn in your dataset. Use thecolumn_mapattribute if the column name is different.
RAGAS - ContextRelevancyScorer
The ContextRelevancyScorer evaluates the relevancy of the provided context to the AI systemโs output. It is based on the RAGAS evaluation library.
- Uses an LLM to rate the relevancy of the context to the output on a scale from 0 to 1.
- Returns a dictionary with the
relevancy_score.
- Expects a
contextcolumn in your dataset. Use thecolumn_mapattribute if the column name is different. - Customize the
relevancy_promptto define how relevancy is assessed.
openai/gpt-4o and openai/text-embedding-3-small. If you want to experiment with other providers, you can update the model_id field to use a different model. For example, to use an Anthropic model: