FlowHunt CLI Toolkit: Open Source Flow Evaluation with LLM as a Judge

FlowHunt’s new open-source CLI toolkit enables comprehensive flow evaluation with LLM as a Judge, providing detailed reporting and automated quality assessment for AI workflows.

FlowHunt CLI Toolkit: Open Source Flow Evaluation with LLM as a Judge

We’re excited to announce the release of the FlowHunt CLI Toolkit – our new open-source command-line tool designed to revolutionize how developers evaluate and test AI flows. This powerful toolkit brings enterprise-grade flow evaluation capabilities to the open-source community, complete with advanced reporting and our innovative “LLM as a Judge” implementation.

Introducing the FlowHunt CLI Toolkit

The FlowHunt CLI Toolkit represents a significant step forward in AI workflow testing and evaluation. Available now on GitHub, this open-source toolkit provides developers with comprehensive tools for:

  • Flow Evaluation: Automated testing and evaluation of AI workflows
  • Advanced Reporting: Detailed analysis with correct/incorrect result breakdowns
  • LLM as a Judge: Sophisticated AI-powered evaluation using our own FlowHunt platform
  • Performance Metrics: Comprehensive insights into flow behavior and accuracy

The toolkit embodies our commitment to transparency and community-driven development, making advanced AI evaluation techniques accessible to developers worldwide.

FlowHunt CLI Toolkit overview

The Power of LLM as a Judge

One of the most innovative features of our CLI toolkit is the “LLM as a Judge” implementation. This approach uses artificial intelligence to evaluate the quality and correctness of AI-generated responses – essentially having AI judge AI performance with sophisticated reasoning capabilities.

How We Built LLM as a Judge with FlowHunt

What makes our implementation unique is that we used FlowHunt itself to create the evaluation flow. This meta-approach demonstrates the power and flexibility of our platform while providing a robust evaluation system. The LLM as a Judge flow consists of several interconnected components:

1. Prompt Template: Crafts the evaluation prompt with specific criteria 2. Structured Output Generator: Processes the evaluation using an LLM 3. Data Parser: Formats the structured output for reporting 4. Chat Output: Presents the final evaluation results

The Evaluation Prompt

At the heart of our LLM as a Judge system is a carefully crafted prompt that ensures consistent and reliable evaluations. Here’s the core prompt template we use:

You will be given an ANSWER and REFERENCE couple.
Your task is to provide the following:
1. a 'total_rating' scoring: how close is the ANSWER to the REFERENCE
2. a binary label 'correctness' which can be either 'correct' or 'incorrect', which defines if the ANSWER is correct or not
3. and 'reasoning', which describes the reason behind your choice of scoring and correctness/incorrectness of ANSWER

An ANSWER is correct when it is the same as the REFERENCE in all facts and details, even if worded differently. the ANSWER is incorrect if it contradicts the REFERENCE, changes or omits details. its ok if the ANSWER has more details comparing to REFERENCE.

'total rating' is a scale of 1 to 4, where 1 means that the ANSWER is not the same as REFERENCE at all, and 4 means that the ANSWER is the same as the REFERENCE in all facts and details even if worded differently.

Here is the scale you should use to build your answer:
1: The ANSWER is contradicts the REFERENCE completely, adds additional claims, changes or omits details
2: The ANSWER points to the same topic but the details are omitted or changed completely comparing to REFERENCE
3: The ANSWER's references are not completely correct, but the details are somewhat close to the details mentioned in the REFERENCE. its ok, if there are added details in ANSWER comparing to REFERENCES.
4: The ANSWER is the same as the REFERENCE in all facts and details, even if worded differently. its ok, if there are added details in ANSWER comparing to REFERENCES. if there are sources available in REFERENCE, its exactly the same as ANSWER and is for sure mentioned in ANSWER

REFERENCE
===
{target_response}
===

ANSWER
===
{actual_response}
===

This prompt ensures that our LLM judge provides:

  • Numerical scoring (1-4 scale) for quantitative analysis
  • Binary correctness classification for clear pass/fail metrics
  • Detailed reasoning for transparency and debugging

Flow Architecture: How It All Works Together

Our LLM as a Judge flow demonstrates sophisticated AI workflow design using FlowHunt’s visual flow builder. Here’s how the components work together:

1. Input Processing

The flow begins with a Chat Input component that receives the evaluation request containing both the actual response and the reference answer.

2. Prompt Construction

The Prompt Template component dynamically constructs the evaluation prompt by:

  • Inserting the reference answer into the {target_response} placeholder
  • Inserting the actual response into the {actual_response} placeholder
  • Applying the comprehensive evaluation criteria

3. AI Evaluation

The Structured Output Generator processes the prompt using a selected LLM and generates structured output containing:

  • total_rating: Numerical score from 1-4
  • correctness: Binary correct/incorrect classification
  • reasoning: Detailed explanation of the evaluation

4. Output Formatting

The Parse Data component formats the structured output into a readable format, and the Chat Output component presents the final evaluation results.

Advanced Evaluation Capabilities

The LLM as a Judge system provides several advanced capabilities that make it particularly effective for AI flow evaluation:

Nuanced Understanding

Unlike simple string matching, our LLM judge understands:

  • Semantic equivalence: Recognizing when different wordings convey the same meaning
  • Factual accuracy: Identifying contradictions or omissions in details
  • Completeness: Evaluating whether answers contain all necessary information

Flexible Scoring

The 4-point rating scale provides granular evaluation:

  • Score 4: Perfect semantic match with all facts preserved
  • Score 3: Close match with minor discrepancies but added details acceptable
  • Score 2: Same topic but significant detail changes or omissions
  • Score 1: Complete contradiction or major factual errors

Transparent Reasoning

Each evaluation includes detailed reasoning, making it possible to:

  • Understand why specific scores were assigned
  • Debug flow performance issues
  • Improve prompt engineering based on evaluation feedback

Comprehensive Reporting Features

The CLI toolkit generates detailed reports that provide actionable insights into flow performance:

Correctness Analysis

  • Binary classification of all responses as correct or incorrect
  • Percentage accuracy across test cases
  • Identification of common failure patterns

Rating Distribution

  • Statistical analysis of rating scores (1-4 scale)
  • Average performance metrics
  • Variance analysis to identify consistency issues

Detailed Reasoning Logs

  • Complete reasoning for each evaluation
  • Categorization of common issues
  • Recommendations for flow improvements

Getting Started with the FlowHunt CLI Toolkit

Ready to start evaluating your AI flows with professional-grade tools? Here’s how to get started:

Quick Installation

One-Line Installation (Recommended) for macOS and Linux:

curl -sSL https://raw.githubusercontent.com/yasha-dev1/flowhunt-toolkit/main/install.sh | bash

This will automatically:

  • ✅ Install all dependencies
  • ✅ Download and install FlowHunt Toolkit
  • ✅ Add flowhunt command to your PATH
  • ✅ Set up everything automatically

Manual Installation:

# Clone the repository
git clone https://github.com/yasha-dev1/flowhunt-toolkit.git
cd flowhunt-toolkit

# Install with pip
pip install -e .

Verify Installation:

flowhunt --help
flowhunt --version

Quick Start Guide

1. Authentication First, authenticate with your FlowHunt API:

flowhunt auth

2. List Your Flows

flowhunt flows list

3. Evaluate a Flow Create a CSV file with your test data:

flow_input,expected_output
"What is 2+2?","4"
"What is the capital of France?","Paris"

Run evaluation with LLM as a Judge:

flowhunt evaluate your-flow-id path/to/test-data.csv --judge-flow-id your-judge-flow-id

4. Batch Execute Flows

flowhunt batch-run your-flow-id input.csv --output-dir results/

Advanced Evaluation Features

The evaluation system provides comprehensive analysis:

flowhunt evaluate FLOW_ID TEST_DATA.csv \
  --judge-flow-id JUDGE_FLOW_ID \
  --output-dir eval_results/ \
  --batch-size 10 \
  --verbose

Features include:

  • 📊 Comprehensive statistics (mean, median, std, quartiles)
  • 📈 Score distribution analysis
  • 📋 Automated CSV result export
  • 🎯 Pass/fail rate calculation
  • 🔍 Error tracking and reporting

Integration with FlowHunt Platform

The CLI toolkit seamlessly integrates with the FlowHunt platform, allowing you to:

  • Evaluate flows built in the FlowHunt visual editor
  • Access advanced LLM models for evaluation
  • Use your existing judge flows for automated evaluation
  • Export results for further analysis

The Future of AI Flow Evaluation

The release of our CLI toolkit represents more than just a new tool – it’s a vision for the future of AI development where:

Quality is Measurable: Advanced evaluation techniques make AI performance quantifiable and comparable.

Testing is Automated: Comprehensive testing frameworks reduce manual effort and improve reliability.

Transparency is Standard: Detailed reasoning and reporting make AI behavior understandable and debuggable.

Community Drives Innovation: Open-source tools enable collaborative improvement and knowledge sharing.

Open Source Commitment

By open-sourcing the FlowHunt CLI Toolkit, we’re demonstrating our commitment to:

  • Community Development: Enabling developers worldwide to contribute and improve the toolkit
  • Transparency: Making our evaluation methodologies open and auditable
  • Accessibility: Providing enterprise-grade tools to developers regardless of budget
  • Innovation: Fostering collaborative development of new evaluation techniques

Conclusion

The FlowHunt CLI Toolkit with LLM as a Judge represents a significant advancement in AI flow evaluation capabilities. By combining sophisticated evaluation logic with comprehensive reporting and open-source accessibility, we’re empowering developers to build better, more reliable AI systems.

The meta-approach of using FlowHunt to evaluate FlowHunt flows demonstrates the maturity and flexibility of our platform while providing a powerful tool for the broader AI development community.

Whether you’re building simple chatbots or complex multi-agent systems, the FlowHunt CLI Toolkit provides the evaluation infrastructure you need to ensure quality, reliability, and continuous improvement.

Ready to elevate your AI flow evaluation? Visit our GitHub repository to get started with the FlowHunt CLI Toolkit today, and experience the power of LLM as a Judge for yourself.

The future of AI development is here – and it’s open source.

Frequently asked questions

What is the FlowHunt CLI Toolkit?

The FlowHunt CLI Toolkit is an open-source command-line tool for evaluating AI flows with comprehensive reporting capabilities. It includes features like LLM as a Judge evaluation, correct/incorrect result analysis, and detailed performance metrics.

How does LLM as a Judge work in FlowHunt?

LLM as a Judge uses a sophisticated AI flow built within FlowHunt to evaluate other flows. It compares actual responses against reference answers, providing ratings, correctness assessments, and detailed reasoning for each evaluation.

Where can I access the FlowHunt CLI Toolkit?

The FlowHunt CLI Toolkit is open-source and available on GitHub at https://github.com/yasha-dev1/flowhunt-toolkit. You can clone, contribute, and use it freely for your AI flow evaluation needs.

What kind of reports does the CLI toolkit generate?

The toolkit generates comprehensive reports including correct/incorrect result breakdowns, LLM as a Judge evaluations with ratings and reasoning, performance metrics, and detailed analysis of flow behavior across different test cases.

Can I use the LLM as a Judge flow for my own evaluations?

Yes! The LLM as a Judge flow is built using FlowHunt's platform and can be adapted for various evaluation scenarios. You can modify the prompt template and evaluation criteria to suit your specific use cases.

Yasha is a talented software developer specializing in Python, Java, and machine learning. Yasha writes technical articles on AI, prompt engineering, and chatbot development.

Yasha Boroumand
Yasha Boroumand
CTO, FlowHunt

Try FlowHunt's Advanced Flow Evaluation

Build and evaluate sophisticated AI workflows with FlowHunt's platform. Start creating flows that can judge other flows today.