* Implement autofill backend * Add autofill to ui * Add argument to getUID to force recalculation of UID's on every call * Add command fill * Move popover to the right * Merge autofill-ui into autofill * Add minimum rows requirement for autofilling * Rename local variable in autofill system * Rename autofill.ts to ai.ts * Implement generate and replace backend function * Add purple AI button * Add ai popover * Add tabs to ai popover * Cosmetic changes to AI popover * Move command fill UI to purple button popover * Add 'creative' toggle to generateAndReplace * Generate and replace UI * Call backend for generate and replace * Change creative to unconventional in generate and replace system * Fix generate and replace * Add loading states * Cosmetic changes * Use sparkle icon * Cosmetic changes * Add a clarifying sentence to the prompt when the user asks for a prompt * Change to markdown * Add error handling to AI system * Improve prompt prompt * Remove 'suggestions loading' message * Change 'pattern' to 'generate a list of' and fix a bug where i forgot to specify unordered markdown list * Limit output to n in decode() * Fix bug in error handling * TEMP: try to fix autofill * TEMP: disable autofill * Finally fix autofill's debouncing * Improve autofill prompt to handle commands * Fix typo with semicolon * Implement autofill backend * Add autofill to ui * Add argument to getUID to force recalculation of UID's on every call * Add command fill * Move popover to the right * Merge autofill-ui into autofill * Add minimum rows requirement for autofilling * Rename local variable in autofill system * Rename autofill.ts to ai.ts * Implement generate and replace backend function * Add purple AI button * Add ai popover * Add tabs to ai popover * Cosmetic changes to AI popover * Move command fill UI to purple button popover * Add 'creative' toggle to generateAndReplace * Generate and replace UI * Call backend for generate and replace * Change creative to unconventional in generate and replace system * Fix generate and replace * Add loading states * Cosmetic changes * Use sparkle icon * Cosmetic changes * Add a clarifying sentence to the prompt when the user asks for a prompt * Change to markdown * Add error handling to AI system * Improve prompt prompt * Remove 'suggestions loading' message * Change 'pattern' to 'generate a list of' and fix a bug where i forgot to specify unordered markdown list * Limit output to n in decode() * Fix bug in error handling * TEMP: try to fix autofill * TEMP: disable autofill * Finally fix autofill's debouncing * Improve autofill prompt to handle commands * Fix typo with semicolon * Refactor the AI Popover into a new component * Refactor the AI Popover into a new component * Refactor the autofill functionality into two backend files * Minor refactoring and styling fixes * Parse markdown using markdown library * Add no_cache flag support in backend to ignore cache for AI popover * trim quotation marks and escape braces in AI autofill * Add AI Support Tab in Global Settings pane. * Convert Jinja braces * Fix typo in AiPopover import * Handle template variables with Extend and Autocomplete + Check template variable correctness in outputs * Escape the braces of generate and replace prompts * Update prompts to strengthen AI support for multiple template variables * Log the system message * Reduce minimum rows required to 1 for autocomplete to begin generating * Reduce min rows to extend to 1 and add warning below 2 * Create a defaultdict utility * Consider null values as nonexistant in defaultdict * Make placeholders stick to their assigned text field without using defaultdict * Make placeholder logic more readable * Cache rendering of text fields to avoid expensive computation * Calculate whether to refresh suggestions based on expected suggestions instead of previous suggestions * Fix bug where LLM was returning templates in generate and replace where none was requested * Force re-render of text fields on Extend * Add Sean Yang to README * Add GenAI support to Items Node * Pass front-end API keys to AI support features * Escape braces on Items Node outputs * Update package to 0.2.8 * Disable autosaving if it takes 1 second or longer to save to localStorage * Skip autosave when browser tab is inactive * Fetch environment API keys only once upon load * Check for OpenAI API key in AIPopover. If not present, display Alert. --------- Co-authored-by: Sean Yang <53060248+shawseanyang@users.noreply.github.com>
10 KiB
⛓️🛠️ ChainForge
An open-source visual programming environment for battle-testing prompts to LLMs.
ChainForge is a data flow prompt engineering environment for analyzing and evaluating LLM responses. It is geared towards early-stage, quick-and-dirty exploration of prompts, chat responses, and response quality that goes beyond ad-hoc chatting with individual LLMs. With ChainForge, you can:
- Query multiple LLMs at once to test prompt ideas and variations quickly and effectively.
- Compare response quality across prompt permutations, across models, and across model settings to choose the best prompt and model for your use case.
- Setup evaluation metrics (scoring function) and immediately visualize results across prompts, prompt parameters, models, and model settings.
- Hold multiple conversations at once across template parameters and chat models. Template not just prompts, but follow-up chat messages, and inspect and evaluate outputs at each turn of a chat conversation.
Read the docs to learn more. ChainForge comes with a number of example evaluation flows to give you a sense of what's possible, including 188 example flows generated from benchmarks in OpenAI evals.
This is an open beta of Chainforge. We support model providers OpenAI, HuggingFace, Anthropic, Google PaLM2, Azure OpenAI endpoints, and Dalai-hosted models Alpaca and Llama. You can change the exact model and individual model settings. Visualization nodes support numeric and boolean evaluation metrics. Try it and let us know what you think! :)
ChainForge is built on ReactFlow and Flask.
Table of Contents
- Documentation
- Installation
- Example Experiments
- Share with Others
- Features (see the docs for more comprehensive info)
- Development and How to Cite
Installation
You can install ChainForge locally, or try it out on the web at https://chainforge.ai/play/. The web version of ChainForge has a limited feature set. In a locally installed version you can load API keys automatically from environment variables, write Python code to evaluate LLM responses, or query locally-run Alpaca/Llama models hosted via Dalai.
To install Chainforge on your machine, make sure you have Python 3.8 or higher, then run
pip install chainforge
Once installed, do
chainforge serve
Open localhost:8000 in a Google Chrome, Firefox, Microsoft Edge, or Brave browser.
You can set your API keys by clicking the Settings icon in the top-right corner. If you prefer to not worry about this everytime you open ChainForge, we recommend that save your OpenAI, Anthropic, and/or Google PaLM API keys to your local environment. For more details, see the Installation Guide.
Example experiments
We've prepared a couple example flows to give you a sense of what's possible with Chainforge.
Click the "Example Flows" button on the top-right corner and select one. Here is a basic comparison example, plotting the length of responses across different models and arguments for the prompt parameter {game}
:
You can also conduct ground truth evaluations using Tabular Data nodes. For instance, we can compare each LLM's ability to answer math problems by comparing each response to the expected answer:
Compare responses across models and prompts
Compare across models and prompt variables with an interactive response inspector, including a formatted table and exportable data:
Share with others
The web version of ChainForge (https://chainforge.ai/play/) includes a Share button.
Simply click Share to generate a unique link for your flow and copy it to your clipboard:
For instance, here's a experiment I made that tries to get an LLM to reveal a secret key: https://chainforge.ai/play/?f=28puvwc788bog
Note
To prevent abuse, you can only share up to 10 flows at a time, and each flow must be <5MB after compression. If you share more than 10 flows, the oldest link will break, so make sure to always Export important flows to
cforge
files, and use Share to only pass data ephemerally.
For finer details about the features of specific nodes, check out the Node Guide.
Features
A key goal of ChainForge is facilitating comparison and evaluation of prompts and models. Basic features are:
- Prompt permutations: Setup a prompt template and feed it variations of input variables. ChainForge will prompt all selected LLMs with all possible permutations of the input prompt, so that you can get a better sense of prompt quality. You can also chain prompt templates at arbitrary depth (e.g., to compare templates).
- Chat turns: Go beyond prompts and template follow-up chat messages, just like prompts. You can test how the wording of the user's query might change an LLM's output, or compare quality of later responses across multiple chat models (or the same chat model with different settings!).
- Model settings: Change the settings of supported models, and compare across settings. For instance, you can measure the impact of a system message on ChatGPT by adding several ChatGPT models, changing individual settings, and nicknaming each one. ChainForge will send out queries to each version of the model.
- Evaluation nodes: Probe LLM responses in a chain and test them (classically) for some desired behavior. At a basic level, this is Python script based. We plan to add preset evaluator nodes for common use cases in the near future (e.g., name-entity recognition). Note that you can also chain LLM responses into prompt templates to help evaluate outputs cheaply before more extensive evaluation methods.
- Visualization nodes: Visualize evaluation results on plots like grouped box-and-whisker (for numeric metrics) and histograms (for boolean metrics). Currently we only support numeric and boolean metrics. We aim to provide users more control and options for plotting in the future.
Taken together, these features let you easily:
- Compare across prompts and prompt parameters: Choose the best set of prompts that maximizes your eval target metrics (e.g., lowest code error rate). Or, see how changing parameters in a prompt template affects the quality of responses.
- Compare across models: Compare responses for every prompt across models and different model settings.
We've also found that some users simply want to use ChainForge to make tons of parametrized queries to LLMs (e.g., chaining prompt templates into prompt templates), possibly score them, and then output the results to a spreadsheet (Excel xlsx
). To do this, attach an Inspect node to the output of a Prompt node and click Export Data
.
For more specific details, see the Node Guide.
Development
ChainForge was created by Ian Arawjo, a postdoctoral scholar in Harvard HCI's Glassman Lab with support from the Harvard HCI community. Collaborators include PhD students Priyan Vaithilingam and Chelse Swoopes, Harvard undergraduate Sean Yang, and faculty members Elena Glassman and Martin Wattenberg.
This work was partially funded by the NSF grants IIS-2107391, IIS-2040880, and IIS-1955699. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
We provide ongoing releases of this tool in the hopes that others find it useful for their projects.
Inspiration and Links
ChainForge is meant to be general-purpose, and is not developed for a specific API or LLM back-end. Our ultimate goal is integration into other tools for the systematic evaluation and auditing of LLMs. We hope to help others who are developing prompt-analysis flows in LLMs, or otherwise auditing LLM outputs. This project was inspired by own our use case, but also shares some comraderie with two related (closed-source) research projects, both led by Sherry Wu:
- "PromptChainer: Chaining Large Language Model Prompts through Visual Programming" (Wu et al., CHI ’22 LBW) Video
- "AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts" (Wu et al., CHI ’22)
Unlike these projects, we are focusing on supporting evaluation across prompts, prompt parameters, and models.
How to collaborate?
We welcome open-source collaborators. If you want to report a bug or request a feature, open an Issue. We also encourage users to implement the requested feature / bug fix and submit a Pull Request.
Cite Us
If you use ChainForge for research purposes, or build upon the source code, we ask that you cite our arXiv pre-print in any related publications. The BibTeX you can use is:
@misc{arawjo2023chainforge,
title={ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing},
author={Ian Arawjo and Chelse Swoopes and Priyan Vaithilingam and Martin Wattenberg and Elena Glassman},
year={2023},
eprint={2309.09128},
archivePrefix={arXiv},
primaryClass={cs.HC}
}
License
ChainForge is released under the MIT License.