ChainForge/README.md

# ⛓️🛠️ ChainForge
**An open-source visual programming environment for battle-testing prompts to LLMs.**

<img width="1615" alt="Screen Shot 2023-05-17 at 2 45 17 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/96aecea7-cf05-4064-8f83-20a524449af7">

ChainForge is a data flow prompt engineering environment for analyzing and evaluating LLM responses. Like Jupyter Notebooks are geared towards early-stage exploration, ChainForge is geared towards early-stage, quick-and-dirty exploration of prompts and response quality that goes beyond ad-hoc chatting with individual LLMs. With CF, you can: 
 - Query multiple LLMs at once to sketch prompt ideas and test variations quickly and effectively. 
 - Compare response quality across prompt variations and across models to choose the best prompt and model for your use case. 
 - Setup an evaluation metric (scoring function) and immediately visualize results across prompts, prompt parameters, and models. 

**This is an open alpha of Chainforge.** We currently support models GPT3.5, GPT4, Claude, and Alpaca 7B (through [Dalai](https://github.com/cocktailpeanut/dalai)) at default settings. Try it and let us know what you think! :) 

ChainForge is built on [ReactFlow](https://reactflow.dev) and [Flask](https://flask.palletsprojects.com/en/2.3.x/).

# Installation

To get started with Chainforge alpha, see the [Installation Guide](https://github.com/ianarawjo/ChainForge/blob/main/GUIDE.md). In the near future, we will upload to PyPI as an official package.

## Example evaluation flows

We've prepared a couple Example flows to give you a sense of what's possible with Chainforge. 
Import them, then:
 - Run any Prompt node(s) to query the LLM(s),
 - Run any Evaluator nodes to score responses.

Note that right now, **exporting a CF flow does not save cache'd responses.**

# Features

A key goal of ChainForge is facilitating **comparison** and **evaluation** of prompts and models, and (in the near future) prompt chains. Basic features are:
- **Prompt permutations**: Setup a prompt template and feed it variations of input variables. ChainForge will prompt all selected LLMs with all possible permutations of the input prompt, so that you can get a better sense of prompt quality. You can also chain prompt templates at arbitrary depth (e.g., to compare templates).
- **Evaluation nodes**: Probe LLM responses in a chain and test them (classically) for some desired behavior. At a basic level, this is Python script based. We plan to add preset evaluator nodes for common use cases in the near future (e.g., name-entity recognition). Note that you can also chain LLM responses into prompt templates to help evaluate outputs cheaply before more extensive evaluation methods.
- **Visualization nodes**: Visualize evaluation results on plots like box-and-whisker and 3D scatterplots.

Taken together, these three features let you easily:
  - **Compare across prompts and prompt parameters**: Choose the best set of prompts that maximizes your eval target metrics (e.g., lowest code error rate). Or, see how changing parameters in a prompt template affects the quality of responses.
  - **Compare across models**: Compare responses for every prompt across models. 

# Development

ChainForge is being developed by research scientists at Harvard University in the [Harvard HCI](https://hci.seas.harvard.edu) group:
- [Ian Arawjo](http://ianarawjo.com/index.html)
- [Priyan Vaithilingam](https://priyan.info)
- [Elena Glassman](https://glassmanlab.seas.harvard.edu/glassman.html)

We provide ongoing releases of this tool in the hopes that others find it useful for their projects.

## Future Planned Features

- **Model settings**: Change settings for individual models, so one can test across the same model with different settings
- **Compare across response batches**: Run an evaluator over all N responses generated for each prompt, to measure factors like variability or parseability (e.g., how many code outputs pass a basic smell test?)
- **System prompts**: Ability to change the system prompt for models that support it (e.g., ChatGPT). Try out different system prompts and compare response quality.
- **Collapse nodes**: Nodes should be collapseable, to save screen space.
- **LMQL and Microsoft guidance nodes**: Support for prompt pipelines that involve LMQL and {{guidance}} code, esp. inspecting masked response variables. 
- **AI assistance for prompt engineering**: Spur creative ideas and quickly iterate on variations of prompts through interaction with GPT4.
- **Compare fine-tuned to base models**: Beyond comparing between different models like Alpaca and ChatGPT, support comparison between versions of the same model (e.g., a base model and a fine-tuned one). Helper users detect where fine-tuning resulted in any 'breaking changes' elsewhere. 
- **Export to code**: In the future, export prompt and (potentially) chains using a programming API like LangChain.
- **Dark mode**: A dark mode theme
- **Compare across chains**: If a prompt P is used *across* chains C1 C2 etc, how does changing it affect all downstream events?

See a feature you'd like that isn't here? Open an [Issue](https://github.com/ianarawjo/ChainForge/issues).

## Inspiration and Links

ChainForge is meant to be general-purpose, and is not developed for a specific API or LLM back-end. Our ultimate goal is integration into other tools for the systematic evaluation and auditing of LLMs. We hope to help others who are developing prompt-analysis flows in LLMs, or otherwise auditing LLM outputs. This project was inspired by own our use case, but also shares some comraderie with two related (closed-source) research projects, both led by [Sherry Wu](https://www.cs.cmu.edu/~sherryw/):
- "PromptChainer: Chaining Large Language Model Prompts through Visual Programming" (Wu et al., CHI ’22 LBW) [Video](https://www.youtube.com/watch?v=p6MA8q19uo0)
- "AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts" (Wu et al., CHI ’22)

Unlike these projects, we are focusing on supporting evaluation across prompts, prompt parameters, and models.

## How to collaborate?

We are looking for open-source collaborators. The best way to do this, at the moment, is simply to implement the requested feature / bug fix and submit a Pull Request. If you want to report a bug or request a feature, open an [Issue](https://github.com/ianarawjo/ChainForge/issues). 

# License

ChainForge is released under the MIT License.
-												Update README.md
											
										
										
											2023-04-10 11:36:59 -04:00
+								# ⛓️🛠️ ChainForge
-												Update README.md
											
										
										
											2023-05-17 15:23:09 -04:00
+								**An open-source visual programming environment for battle-testing prompts to LLMs.**
-												Update README.md
											
										
										
											2023-03-26 13:04:55 -04:00
-												Update README.md
											
										
										
											2023-05-17 14:57:19 -04:00
+								<img width="1615" alt="Screen Shot 2023-05-17 at 2 45 17 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/96aecea7-cf05-4064-8f83-20a524449af7">
-												Update README.md
											
										
										
											2023-03-26 13:04:55 -04:00
-												Update README.md
											
										
										
											2023-05-17 15:12:22 -04:00
+								ChainForge is a data flow prompt engineering environment for analyzing and evaluating LLM responses. Like Jupyter Notebooks are geared towards early-stage exploration, ChainForge is geared towards early-stage, quick-and-dirty exploration of prompts and response quality that goes beyond ad-hoc chatting with individual LLMs. With CF, you can:
 								 - Query multiple LLMs at once to sketch prompt ideas and test variations quickly and effectively.
 								 - Compare response quality across prompt variations and across models to choose the best prompt and model for your use case.
 								 - Setup an evaluation metric (scoring function) and immediately visualize results across prompts, prompt parameters, and models.
-												Update README.md
											
										
										
											2023-05-17 14:57:19 -04:00
 								**This is an open alpha of Chainforge.** We currently support models GPT3.5, GPT4, Claude, and Alpaca 7B (through [Dalai](https://github.com/cocktailpeanut/dalai)) at default settings. Try it and let us know what you think! :)
 								ChainForge is built on [ReactFlow](https://reactflow.dev) and [Flask](https://flask.palletsprojects.com/en/2.3.x/).
-												Update README.md
											
										
										
											2023-03-26 13:04:55 -04:00
-												Update README.md
											
										
										
											2023-05-17 15:30:09 -04:00
+								# Installation
 								To get started with Chainforge alpha, see the [Installation Guide](https://github.com/ianarawjo/ChainForge/blob/main/GUIDE.md). In the near future, we will upload to PyPI as an official package.
 								## Example evaluation flows
 								We've prepared a couple Example flows to give you a sense of what's possible with Chainforge.
 								Import them, then:
 								 - Run any Prompt node(s) to query the LLM(s),
 								 - Run any Evaluator nodes to score responses.
 								Note that right now, **exporting a CF flow does not save cache'd responses.**
-												Update README.md
											
										
										
											2023-03-26 14:12:27 -04:00
+								# Features
-												Update README.md
											
										
										
											2023-04-30 21:08:07 -04:00
+								A key goal of ChainForge is facilitating **comparison** and **evaluation** of prompts and models, and (in the near future) prompt chains. Basic features are:
-												Update README.md
											
										
										
											2023-05-02 22:33:10 -04:00
+								- **Prompt permutations**: Setup a prompt template and feed it variations of input variables. ChainForge will prompt all selected LLMs with all possible permutations of the input prompt, so that you can get a better sense of prompt quality. You can also chain prompt templates at arbitrary depth (e.g., to compare templates).
 								- **Evaluation nodes**: Probe LLM responses in a chain and test them (classically) for some desired behavior. At a basic level, this is Python script based. We plan to add preset evaluator nodes for common use cases in the near future (e.g., name-entity recognition). Note that you can also chain LLM responses into prompt templates to help evaluate outputs cheaply before more extensive evaluation methods.
-												Update README.md
											
										
										
											2023-04-30 21:08:07 -04:00
+								- **Visualization nodes**: Visualize evaluation results on plots like box-and-whisker and 3D scatterplots.
 								Taken together, these three features let you easily:
-												Update README.md
											
										
										
											2023-05-17 15:12:22 -04:00
+								  - **Compare across prompts and prompt parameters**: Choose the best set of prompts that maximizes your eval target metrics (e.g., lowest code error rate). Or, see how changing parameters in a prompt template affects the quality of responses.
 								  - **Compare across models**: Compare responses for every prompt across models.
-												Update README.md
											
										
										
											2023-05-02 22:33:10 -04:00
-												Update README.md
											
										
										
											2023-04-30 21:17:42 -04:00
+								# Development
-												Update README.md
											
										
										
											2023-03-26 13:16:47 -04:00
-												Update README.md
											
										
										
											2023-05-17 15:22:46 -04:00
+								ChainForge is being developed by research scientists at Harvard University in the [Harvard HCI](https://hci.seas.harvard.edu) group:
-												Update README.md
											
										
										
											2023-03-26 13:16:47 -04:00
+								- [Ian Arawjo](http://ianarawjo.com/index.html)
 								- [Priyan Vaithilingam](https://priyan.info)
-												Update README.md
											
										
										
											2023-05-17 15:22:46 -04:00
+								- [Elena Glassman](https://glassmanlab.seas.harvard.edu/glassman.html)
-												Update README.md
											
										
										
											2023-03-26 13:16:47 -04:00
-												Update README.md
											
										
										
											2023-04-30 21:17:42 -04:00
+								We provide ongoing releases of this tool in the hopes that others find it useful for their projects.
-												Update README.md
											
										
										
											2023-03-26 13:32:53 -04:00
-												Update README.md
											
										
										
											2023-04-30 20:59:51 -04:00
+								## Future Planned Features
-												Update README.md
											
										
										
											2023-05-17 15:12:22 -04:00
+								- **Model settings**: Change settings for individual models, so one can test across the same model with different settings
 								- **Compare across response batches**: Run an evaluator over all N responses generated for each prompt, to measure factors like variability or parseability (e.g., how many code outputs pass a basic smell test?)
-												Update README.md
											
										
										
											2023-05-17 15:58:58 -04:00
+								- **System prompts**: Ability to change the system prompt for models that support it (e.g., ChatGPT). Try out different system prompts and compare response quality.
-												Update README.md
											
										
										
											2023-05-02 10:36:39 -04:00
+								- **Collapse nodes**: Nodes should be collapseable, to save screen space.
-												Update README.md
											
										
										
											2023-05-17 15:22:46 -04:00
+								- **LMQL and Microsoft guidance nodes**: Support for prompt pipelines that involve LMQL and {{guidance}} code, esp. inspecting masked response variables.
-												Update README.md
											
										
										
											2023-04-30 20:59:51 -04:00
+								- **AI assistance for prompt engineering**: Spur creative ideas and quickly iterate on variations of prompts through interaction with GPT4.
-												Update README.md
											
										
										
											2023-05-02 10:36:39 -04:00
+								- **Compare fine-tuned to base models**: Beyond comparing between different models like Alpaca and ChatGPT, support comparison between versions of the same model (e.g., a base model and a fine-tuned one). Helper users detect where fine-tuning resulted in any 'breaking changes' elsewhere.
-												Update README.md
											
										
										
											2023-05-17 16:11:31 -04:00
+								- **Export to code**: In the future, export prompt and (potentially) chains using a programming API like LangChain.
-												Update README.md
											
										
										
											2023-05-02 10:36:39 -04:00
+								- **Dark mode**: A dark mode theme
-												Update README.md
											
										
										
											2023-05-17 16:11:31 -04:00
+								- **Compare across chains**: If a prompt P is used *across* chains C1 C2 etc, how does changing it affect all downstream events?
-												Update README.md
											
										
										
											2023-04-30 20:59:51 -04:00
-												Update README.md
											
										
										
											2023-05-17 15:22:46 -04:00
+								See a feature you'd like that isn't here? Open an [Issue](https://github.com/ianarawjo/ChainForge/issues).
-												Update README.md
											
										
										
											2023-04-30 21:17:42 -04:00
+								## Inspiration and Links
-												Update README.md
											
										
										
											2023-03-26 13:32:53 -04:00
-												Update README.md
											
										
										
											2023-05-17 15:30:09 -04:00
+								ChainForge is meant to be general-purpose, and is not developed for a specific API or LLM back-end. Our ultimate goal is integration into other tools for the systematic evaluation and auditing of LLMs. We hope to help others who are developing prompt-analysis flows in LLMs, or otherwise auditing LLM outputs. This project was inspired by own our use case, but also shares some comraderie with two related (closed-source) research projects, both led by [Sherry Wu](https://www.cs.cmu.edu/~sherryw/):
-												Update README.md
											
										
										
											2023-04-30 21:17:42 -04:00
+								- "PromptChainer: Chaining Large Language Model Prompts through Visual Programming" (Wu et al., CHI ’22 LBW) [Video](https://www.youtube.com/watch?v=p6MA8q19uo0)
 								- "AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts" (Wu et al., CHI ’22)
-												Update README.md
											
										
										
											2023-05-17 15:30:09 -04:00
+								Unlike these projects, we are focusing on supporting evaluation across prompts, prompt parameters, and models.
-												Update README.md
											
										
										
											2023-04-10 10:34:03 -04:00
 								## How to collaborate?
-												Update README.md
											
										
										
											2023-05-17 15:30:09 -04:00
+								We are looking for open-source collaborators. The best way to do this, at the moment, is simply to implement the requested feature / bug fix and submit a Pull Request. If you want to report a bug or request a feature, open an [Issue](https://github.com/ianarawjo/ChainForge/issues).
-												Update README.md
											
										
										
											2023-04-30 21:17:42 -04:00
 								# License
 								ChainForge is released under the MIT License.