Load OpenAI evals as example flows (#74)

* Add OpenAI Evals tab to Example Flows pane.

* Add OpenAI evals examples (preconverted).

* Set unique IDs for each oaievals cforge file

* Use contenteditable divs in tables to improve performance.

* Update eval code to use json.loads instead of eval()

* Fix bug with $s in templates

* Update package info and point oaievals to main branch

* Made column headers use contenteditable p tags

* Add requests to dependency list

* Rebuilt react and updated package version
This commit is contained in:
ianarawjo 2023-06-15 15:41:58 -04:00 committed by GitHub
parent 655e1e6312
commit 1d08507c93
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
211 changed files with 901 additions and 93 deletions

1
.gitignore vendored
View File

@ -2,6 +2,7 @@
chainforge/react-server/node_modules
__pycache__
chainforge/cache
chainforge/examples/oaievals/
# package build folders (sdist)
chainforge.egg-info/

View File

@ -968,6 +968,73 @@ def fetchExampleFlow():
ret.headers.add('Access-Control-Allow-Origin', '*')
return ret
@app.route('/app/fetchOpenAIEval', methods=['POST'])
def fetchOpenAIEval():
"""
Fetches a preconverted OpenAI eval as a .cforge JSON file.
First detects if the eval is already in the cache. If the eval is already downloaded,
it will be stored in examples/ folder of the package under a new oaievals directory.
If it's not in the cache, it will download it from the ChainForge webserver.
POST'd data should be in form:
{
name: <str> # The name of the eval to grab (without .cforge extension)
}
"""
# Verify post'd data
data = request.get_json()
if 'name' not in data:
return jsonify({'error': 'Missing "name" parameter to fetchOpenAIEval.'})
evalname = data['name']
# Verify 'examples' directory exists:
if not os.path.isdir(EXAMPLES_DIR):
dirpath = os.path.dirname(os.path.realpath(__file__))
return jsonify({'error': f'Could not find an examples/ directory at path {dirpath}'})
# Check if an oaievals subdirectory exists; if so, check for the file; if not create it:
oaievals_cache_dir = os.path.join(EXAMPLES_DIR, "oaievals")
if os.path.isdir(oaievals_cache_dir):
filepath = os.path.join(oaievals_cache_dir, evalname + '.cforge')
if os.path.isfile(filepath):
# File was already downloaded. Load it from cache:
try:
with open(filepath, 'r', encoding='utf-8') as f:
filedata = json.load(f)
except Exception as e:
return jsonify({'error': f"Error parsing OpenAI evals flow at {filepath}: {str(e)}"})
ret = jsonify({'data': filedata})
ret.headers.add('Access-Control-Allow-Origin', '*')
return ret
# File was not downloaded
else:
# Directory does not exist yet; create it
try:
os.mkdir(oaievals_cache_dir)
except Exception as e:
return jsonify({'error': f"Error creating a new directory 'oaievals' at filepath {oaievals_cache_dir}: {str(e)}"})
# Download the preconverted OpenAI eval from the GitHub main branch for ChainForge
import requests
_url = f"https://raw.githubusercontent.com/ianarawjo/ChainForge/main/chainforge/oaievals/{evalname}.cforge"
response = requests.get(_url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the response as JSON
filedata = response.json()
# Store to the cache:
with open(os.path.join(oaievals_cache_dir, evalname + '.cforge'), 'w', encoding='utf8') as f:
json.dump(filedata, f)
else:
print("Error:", response.status_code)
return jsonify({'error': f"Error downloading OpenAI evals flow from {_url}: status code {response.status_code}"})
ret = jsonify({'data': filedata})
ret.headers.add('Access-Control-Allow-Origin', '*')
return ret
def run_server(host="", port=8000, cmd_args=None):
if cmd_args is not None and cmd_args.dummy_responses:

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,29 @@
# Preconverted OpenAI evals
The ChainForge flows in this directory were derived from a subset of OpenAI's evals registry: https://github.com/openai/evals
These files are _not_ included in the PyPI chainforge package, but rather fetched from GitHub on an as-needed basis.
This is to avoid requiring users to install OpenAI evals (which requires Git LFS, Python 3.9+, and a large number of dependencies).
OpenAI evals is under the MIT License:
MIT License
Copyright (c) 2023 OpenAI
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1 @@
{"flow": {"nodes": [{"width": 312, "height": 311, "id": "prompt-algebra-word-problems", "type": "prompt", "data": {"prompt": "{prompt}", "n": 1, "llms": [{"key": "aa3c0f03-22bd-416e-af4d-4bf5c4278c99", "settings": {"system_msg": "Answer the following question with a single number and no additional text. You are a helpful assistant.", "temperature": 1, "functions": [], "function_call": "", "top_p": 1, "stop": [], "presence_penalty": 0, "frequency_penalty": 0}, "name": "GPT3.5", "emoji": "\ud83d\ude42", "model": "gpt-3.5-turbo", "base_model": "gpt-3.5-turbo", "temp": 1, "formData": {"shortname": "GPT3.5", "model": "gpt-3.5-turbo", "system_msg": "Answer the following question with a single number and no additional text. You are a helpful assistant.", "temperature": 1, "functions": "", "function_call": "", "top_p": 1, "stop": "", "presence_penalty": 0, "frequency_penalty": 0}}]}, "position": {"x": 448, "y": 224}, "selected": false, "positionAbsolute": {"x": 448, "y": 224}, "dragging": false}, {"width": 333, "height": 182, "id": "eval-algebra-word-problems", "type": "evaluator", "data": {"code": "def evaluate(response):\n\tideal = response.meta['Ideal']\n\treturn response.text.startswith(ideal)"}, "position": {"x": 820, "y": 150}, "positionAbsolute": {"x": 820, "y": 150}}, {"width": 228, "height": 196, "id": "vis-algebra-word-problems", "type": "vis", "data": {"input": "eval-algebra-word-problems"}, "position": {"x": 1200, "y": 250}, "positionAbsolute": {"x": 1200, "y": 250}}, {"width": 302, "height": 260, "id": "inspect-algebra-word-problems", "type": "inspect", "data": {"input": "prompt-algebra-word-problems"}, "position": {"x": 820, "y": 400}, "positionAbsolute": {"x": 820, "y": 400}}, {"width": 368, "height": 191, "id": "table-algebra-word-problems", "type": "table", "data": {"rows": [{"prompt": "If it takes 5 machines 5 minutes to make 5 devices, how long would it take 100 machines to make 100 devices?", "ideal": "5"}, {"prompt": "What is the sum of 60000, 5000, 400, and 3, with the third value multiplied by 5 before performing the operation?", "ideal": "67003"}, {"prompt": "If the sum of the smallest and largest of three consecutive even numbers is 28, what is the value of the second largest number in the series?", "ideal": "14"}, {"prompt": "John is trying to fill a 16 oz. bottle with water. If John fills the bottle at 1 oz per second and the bottle leaks .2 oz per second, how long would it take for John to fill the bottle?", "ideal": "20"}, {"prompt": "Annie is training for a marathon. She has a weekly training routine, training for five hours a day on some days and 3 hours a day on the other days. She trains a total of 27 hours in a seven day week. On how many days does she train for five hours?", "ideal": "3"}, {"prompt": "At the start of the year the ratio of boys to girls in a class is 2 : 1. But now, half a year later, four boys have left the class and there are two new girls. The ratio of boys to girls is now 4 : 3. How many students are there altogether now?", "ideal": "28"}], "columns": [{"key": "prompt", "header": "Prompt"}, {"key": "ideal", "header": "Ideal"}]}, "position": {"x": 32, "y": 240}, "selected": true, "positionAbsolute": {"x": 32, "y": 240}, "dragging": false}], "edges": [{"source": "prompt-algebra-word-problems", "sourceHandle": "prompt", "target": "eval-algebra-word-problems", "targetHandle": "responseBatch", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-prompt-1686756357355prompt-eval-1686756357355responseBatch"}, {"source": "prompt-algebra-word-problems", "sourceHandle": "prompt", "target": "inspect-algebra-word-problems", "targetHandle": "input", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-prompt-1686756357355prompt-inspect-1686756357355input"}, {"source": "eval-algebra-word-problems", "sourceHandle": "output", "target": "vis-algebra-word-problems", "targetHandle": "input", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-eval-1686756357355output-vis-1686756357355input"}, {"source": "table-algebra-word-problems", "sourceHandle": "Prompt", "target": "prompt-algebra-word-problems", "targetHandle": "prompt", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-table-1686756385002Prompt-prompt-1686756357355prompt"}], "viewport": {"x": 144, "y": 37, "zoom": 1}}, "cache": {"eval-1686756357355.json": {}, "inspect-1686756357355.json": {}, "prompt-1686756357355.json": {}, "table-1686756385002.json": {}, "vis-1686756357355.json": {}}}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,191 @@
{
"adultery_state_laws": "This evaluation checks the model's ability to accurately answer true or false questions about adultery laws in various states.",
"rock-climbing": null,
"chess": "Test the model's ability to play chess",
"banking77": null,
"shape-in-shape": "Test the model's ability to check whether a given shape will fit within another shape.",
"hebrew-rhyme": "Composite task that involves translation and rhyming.",
"syllables_long_words": null,
"find_country_from_svg": "Test the model's ability to distinguish a country based on its svg shape (from wikimedia svg file).",
"crepe": null,
"belarusian-proverbs": "Test the model's ability to complete proverbs in belarusian language",
"french-part-of-speech": "Test the model's knowledge what part of speech a given word can have in French, using data from fr.wiktionary.org (as of 2023-05-20)",
"internal_representations": null,
"python_list_comprehension": "Test model's ability to understand a basic usage of python's list comprehension syntax.",
"belarusian-syllable-count": "Test the model's ability to count syllables in Belarusian words.",
"mandaliof-table": "Test the model's ability to determine which atom has the largest atomic number.",
"tracking-shuffled-objects": null,
"squares-gpt": "Test the model's ability to solve basic geometric reasoning questions.",
"logic-statements": null,
"financial-derivatives": "Testing the models ability to answer derivative questions correctly.",
"vigenere": "Test the model's ability to perform the simple Vigenere character operation.",
"map-electronic-component-part-to-fact": null,
"rare-and-loanwords-dutch-lexicon": "Test the model's ability to distinguish between existing Dutch words, including rare words and loanwords.",
"sort-numeric": "Tests performance sorting different comma-separated values under different circumstances (integers/decimals, positives/negatives, as well as currency-formatted values).",
"russian-lexicon": "Test the model's ability to distinguish between existing Russian words.",
"dutch-lexicon": "Test the model's ability to distinguish between existing and often misspelled and hallucinated Dutch words.",
"matrix-mult-rows": "Test the model's mathematical ability to infer what is needed to multiply two matrices.",
"solve-for-variable": "Multiple-choice questions about solving a mathematical equation for a variable.",
"moral_exceptQA": "This eval tests the models ability to align with human intuition on when is it acceptable to break an established moral norm.",
"turkish_characters": "Eval that checks ability to identify non-english characters in a Turkish text.",
"find-thirukkural": "Accurately finds the correct Thirukkural in Tamil which the user asks for in English.",
"passing-balls": "Tests the model's ability to correctly determine the last player holding a ball after a sequence of passes.",
"ordered-history-events": null,
"building_floorplan": null,
"japanese-national-medical-exam01": null,
"lat_long_identify": null,
"norwegian-lexicon": "Test the model's ability to distinguish old Norwegian words.",
"german-part-of-speech": "Test the model's knowledge what part of speech a given word can have in German, using data from de.wiktionary.org (as of 2023-05-20)",
"italian-rhyme": "Composite task that involves translation and rhyming.",
"swedish_sat": "Test the model's ability to answer questions from the Swedish h\u00f6gskoleprovet, kind of like the SATs in the US. The 30 questions are from the spring test 2023 verbal part, test number 3.",
"utility_price_parsing": null,
"korean-consonant-vowel-combination": "Evaluating the model's ability to accurately combine Korean consonants and vowels to form Hangul character.",
"mate-in-one": "Find the checkmating move for various board positions",
"french-lexicon": "Test the model's ability to distinguish between existing French words.",
"swedish-spelling": "Test the model's ability to identify misspelled Swedish words.",
"hindi_words": null,
"arithmetical_puzzles": "Test the model's ability to solve complex arithmetical puzzles stated in natural language.",
"body-movement": "Test the model's ability to understand human body movement",
"afrikaans-lexicon": "Test the model's ability to distinguish between existing Afrikaans words.",
"cricket_situations": "Tests the models ability to apply rules of the sport cricket to different situations",
"2d_movement": "Test the model's ability to keep track of position and orientation in a 2D environment.",
"korean_spelling": null,
"hebrew-bible": "Simple questions on the bible, similar to preliminary questions in the international yearly bible contest in Israel.",
"isosceles-right-triangle": null,
"medmcqa": null,
"multi-step-equations": null,
"islands": "Testing the models ability to answer prefecture of given Japanese remote island.",
"escher-sentences": null,
"track_objects": "Test the model's ability to track objects after being moved around",
"shopping_discount_comparison": "Test the model's ability to compare discounts and select the best one",
"test-comp-sci": "Testing the models ability to answer multiple choice computer science questions correctly.",
"ph_calculation": "Test the model's ability to apply basic mathematics to chemistry problems.",
"job_listing_title_for_a_caregiver_in_japan": "Test to identify the job listing title for a caregiver in Japan.",
"poker_analysis": "Examine the model's capacity to strategize & make probabilistic reasoning within the framework of poker.",
"algebra-word-problems": "Test the model's ability to perform basic algebra word problems",
"music_theory_scale_modes": "Test the model's ability to identify which western music scale a series of 8 notes belongs to",
"belarusian-grammar": "Test the model's ability to distinguish between grammatically well-formed and ungrammatical Belarusian sentences.",
"svg_understanding": "Test visual understanding of SVG files.",
"linear-equations": null,
"japanese_driving_license": "Test the model's ability to correctly answer Japanese Driving licence exam.",
"first-letters": null,
"ambiguous-sentences": "test pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution.",
"japanese-itpassport-exam01": "source from IT\u30d1\u30b9\u30dd\u30fc\u30c8\u8a66\u9a13 \u4ee4\u548c5\u5e74\u5ea6\u5206(IT Passport Examination for FY2023) in https://www3.jitec.ipa.go.jp/JitesCbt/html/openinfo/questions.html",
"logiqa": null,
"chinese_zodiac": null,
"food": null,
"simple_physics_engine": "Test the model's ability to reason about and simulate a simplified physics model in a 2d environment.",
"countries": null,
"which-is-heavier": "Test the model's ability to determine which of two quantities is heavier when the heavier quantity is made up of lighter objects (and vice versa).",
"color_theory_complementary": "Test the model's ability to accurately recognize complementary colors in the color theory.",
"fcc_amateur_extra": "Multiple choice questions (with answers) about from the US FCC Amateur Radio License question pool.",
"multistep-word-problems": "Test the model's ability to solve complex, multistep math word problems",
"emoji-riddle": "Test the model's ability to solve emoji riddles.",
"list_comparison_missing_name": "Test the model's ability to determine which name is present in list 1 but not in list 2. List 1 is formatted 'First Last' while list two is formatted 'Last First'. Lists are between 20-35 names long.",
"newsology": "Ask the model to pick a fruit, when telling the model that we have provided a list of vegetables. And then vice versa (pick vegetable, from basket of fruit).",
"portuguese-syllable-count": "Evaluates how many syllabels a given word has.",
"south-african-bands": "Test the model's ability to understand that we are providing the name of a South African band, find the supplied band, and if the band has a lead vocalist provide the stage name or real name of the vocalist.",
"numeral-type-comparisons": "Evaluate the LLM's ability to compare similar or identical numerals across formats in arithmetic and linguistic contexts",
"rot13": "Test the model's ability to perform the simple ROT13 character level operation.",
"music-theory-chord-notes": "Test the model's ability to spell out the notes in a given chord name",
"russian-english-homonym-context-resolution": null,
"number-reading": "Test the model's ability to translate chinese written number into arabic numerals.",
"simple-knowledge-mongolian": "Test the model's ability to understand simple world knowledge in mongolian language cyrillic and latin variants",
"reverse-polish-notation": "Test the model's ability to parse expression and create reverse polish notation.",
"dna-melting-calculation": "Test the model's ability to solve DNA melting temperature problems.",
"born-first": "Test the model's ability to determine who was born first.",
"tetris": "Tests the models ability of spacial awareness by rotating tetris cubes. Tests all 7 classic tetris blocks and performs clockwise and counterclockwise rotations from different starting points.",
"pure_korean": "Evaluates GPT can identify pure Korean words.",
"determinant": null,
"split_chinese_characters": null,
"syntax-check": "Test the model's ability to determine programming language from a snippet.",
"balance-chemical-equation": null,
"seating_arrangements": "Test the model's spatial reasoning ability using seating arrangement questions with limited solution sets.",
"test_japanese_units": "In Japan, when counting things, the unit changes depending on the type. Test your use of complex units.",
"day-of-week-from-date": null,
"points-on-line": "Tests the model's ability to calculate three points (start, center, end) on a line.",
"regex-match": null,
"find-letter": null,
"greek-vocabulary": null,
"asl-classifiers": "Test the model's ability to understand the usage of ASL classifiers.",
"number-pattern": null,
"kanji-idioms": "Test the model's ability to recognize kanji idioms.",
"missing-operators": "Example eval that checks sampled text matches the expected output.",
"unsolvable_questions": null,
"portuguese-sarcasm": "An evaluation on sarcasm detection in Portuguese sentences",
"swap-words": null,
"hebrew-same-noun-gender": "Do these hebrew nouns have the same grammatical gender?",
"heart-disease": "Test model's capability of predicting the presence of heart disease.",
"last-word-nth": "Test the model's ability to tell what the last word of a sentence is, but by asking it indirectly based on its index.",
"italian-new-words": "Test the model's ability to distinguish Italian words that have recently entered the language.",
"irony": "Tests the ability to identify one of three types of irony, situational, verbal, or dramatic.",
"geometry_puzzle": "Assesses the model's performance in solving spatial and geometrical puzzles that require imagination, logic, and pattern recognition.",
"nepali-song-singer": "Test the model's ability to understand English transliteration of Nepali phrase and provide us the singer of that particular title.",
"canto_wu_pronunciation": "Test the model's knowledge of Cantonenese and Wu Chinese pronounciation in a zero-shot setting",
"test_japanese_radical": "In Japan, the radical changes depending on the type of kanji. Test your reading of various radicals.",
"invert_word_wise": "Logically, inverting strings twice just results in the original string again. The LLMs find it very difficult to deduce it, and somehow (at least up to GPT-3.5) mix things up.",
"unified-patch": null,
"imperial_date_to_string": null,
"count_token_freq_dna": "Test the model's ability to count the occurrence of a specific nucleotide (A, T, G, or C) within provided DNA sequences.",
"chess-piece-count": "Test the model's ability to understand chess moves, rules and theory",
"cube-pack": null,
"finnish-rhyme": "Composite task that involves translation and rhyming.",
"historical-kana-orthography-reading": "Test the model's ability to reading historical kana orthography.",
"count_intersections_polynomial": "Test the models ability to count the intersections between the x-axis and a polynomial of third degree, with simple inputs that humans would be able to do in their head.",
"bitwise": "Test the model's ability to simulate a simple bitwise operating machine",
"shared-borders": "Test the model's ability to list the countries that share a land border with a given pair of countries. This tests the model's ability to intersect sets known within its weights.",
"atpl_exams": null,
"invoice_due_date_leap_day_adjustment": null,
"european-date-format-challenge": "This performance evaluation examines the model's ability to reasonably assume that a date in a text follows the DD/MM/YYYY format when a subsequent date in the text is invalid for the MM/DD/YYYY format (e.g., 27/2/2024).",
"infiniteloop-match": "Test the model's ability to recognized if a piece of code can get into a state where it would run forever.",
"counterfactual-reasoning": "Example eval that uses fuzzy matching to score completions.",
"polish-syllable-count": null,
"brazilian_laws": null,
"bulgarian-lexicon": "Test the model's ability to distinguish between existing and hallucinated Bulgarian words.",
"compare-countries-area": "Test the model's ability to determine which country has the largest area.",
"pattern_identification": null,
"japanese_populer_video_game_title_and_the_publisher": "Test the model's ability to identify game publisher published popular japanese video games.",
"belarusian-synonyms": "Test the model's ability to classify if the Belarusian words are synonyms or not.",
"spanish_feminine_noun_masculine_article": "In Spanish there are are a number of nouns like \"agua\" which are feminine but use the masculine article, \"El agua\" is correct and \"La agua\" is incorrect",
"chinese_tang_poetries": "Evaluate the mobel's ability of identifying the accurate author of Chinese Tang Poetries.",
"japanese_number_reading": "Test the model's ability to translate japanese written number into arabic numerals.",
"resistor-ohm-calculator": "Test the model's ability to calculate resistance (in ohms) of a resistor, given color of each band",
"gol": "Robust test. Evaluate model's ability to determine the next state in a simple game of life board",
"json_patch_object": "Test the model's ability to create minimal, correct JSON Patches for nested objects.",
"finance": "Test the model's ability to understand financial concepts and do math.",
"reverse-string": "Test the model's ability to reverse complex and simple strings.",
"tempo_to_measure_count": "Test the model's ability to calculate the number of measures in a song, based on the tempo of each note and the corresponding time signature of the piece.",
"directions": "Eval that tests the models ability to keep state of direction after a series of turns",
"hindi_shuddha": null,
"diagrammatic_logic": null,
"polish-lexicon": "Test the model's ability to distinguish between existing and hallucinated Polish words.",
"numbers_game": "Test the model's ability to solve permutation questions",
"wkt_understanding": "Test understanding of Multipolygon WKT (Well-Known Text) representation of vector geometry objects (https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry).",
"japanese-national-medical-exam02": null,
"three-pt-mapping": "Test the model's ability to calculate gene positions given a three-point cross using the laws of genetics",
"indonesian_numbers": null,
"russian-rhyme": "Composite task that involves translation and rhyming.",
"taxes": null,
"crontab": null,
"date-booking": null,
"stats-tests": null,
"belarusian-russian-translation": "Test the model's ability to recover Belarusian sentences by translating into Russian and back.",
"date-calculator": null,
"jee-math": null,
"korean_yaminjeongeum": "Yamin-Jeongeum is a leetspeak for Korean. Test your ability to translate it to proper Korean.",
"belarusian-lexicon": "Test the model's ability to distinguish between existing and hallucinated Belarusian words.",
"guess-the-singer": "Test the model's ability to predict singer by the first 10 words of the song",
"russian_medical": null,
"probability_questions": "A collection of probability questions that ChatGPT fails. Let's see if GPT-4 can do better.",
"aime_evaluation": "Test the model's ability to solve math problems from the AIME competition.",
"vintage_phone_keyboard_decode": "An array of correspondence between letters and numbers on the mobile phone keyboard evals, examining the model the ability to distinguish and analyze the relationship within groups in multiple groups composed of English letters and numbers.",
"formal-grammar-to-regex": null,
"largest_country": "Determining the largest country by the area from the list",
"comprehensive-graph-reasoning": "Test the model's ability to identify the number of rings and clusters, and the shortest path between two random nodes in undirected, weighted graphs.",
"rhetorical-devices": "Evaluate model's understanding of rhetorical device usage in sentences",
"word_vector_over_reliance": "Example eval that checks sampled text matches the expected output.",
"beam-analysis": "Test the model's ability to solve beam analysis questions",
"partially_solved_crossword_clues": null,
"physics-interaction": "Test the model's ability to predict the direction in which an object is likely to fall towards.",
"next-val-series": "Test the model's ability to predict the next value in a series."
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1 @@
{"flow": {"nodes": [{"width": 312, "height": 311, "id": "prompt-counterfactual-reasoning", "type": "prompt", "data": {"prompt": "{prompt}", "n": 1, "llms": [{"key": "aa3c0f03-22bd-416e-af4d-4bf5c4278c99", "settings": {"system_msg": "You are a helpful assistant.", "temperature": 1, "functions": [], "function_call": "", "top_p": 1, "stop": [], "presence_penalty": 0, "frequency_penalty": 0}, "name": "GPT3.5", "emoji": "\ud83d\ude42", "model": "gpt-3.5-turbo", "base_model": "gpt-3.5-turbo", "temp": 1, "formData": {"shortname": "GPT3.5", "model": "gpt-3.5-turbo", "system_msg": "You are a helpful assistant.", "temperature": 1, "functions": "", "function_call": "", "top_p": 1, "stop": "", "presence_penalty": 0, "frequency_penalty": 0}}]}, "position": {"x": 448, "y": 224}, "selected": false, "positionAbsolute": {"x": 448, "y": 224}, "dragging": false}, {"width": 333, "height": 182, "id": "eval-counterfactual-reasoning", "type": "evaluator", "data": {"code": "def evaluate(response):\n\ttxt = response.text\n\tideal = response.meta['Ideal']\n\treturn txt in ideal or ideal in txt"}, "position": {"x": 820, "y": 150}, "positionAbsolute": {"x": 820, "y": 150}}, {"width": 228, "height": 196, "id": "vis-counterfactual-reasoning", "type": "vis", "data": {"input": "eval-counterfactual-reasoning"}, "position": {"x": 1200, "y": 250}, "positionAbsolute": {"x": 1200, "y": 250}}, {"width": 302, "height": 260, "id": "inspect-counterfactual-reasoning", "type": "inspect", "data": {"input": "prompt-counterfactual-reasoning"}, "position": {"x": 820, "y": 400}, "positionAbsolute": {"x": 820, "y": 400}}, {"width": 368, "height": 191, "id": "table-counterfactual-reasoning", "type": "table", "data": {"rows": [{"prompt": "If the sky flies in a bird, then what does the ground run on?", "ideal": "human"}, {"prompt": "If a song sings a bird, then what does a book read?", "ideal": "human"}, {"prompt": "If the river swims in a fish, then what does a bone chew?", "ideal": "dog"}, {"prompt": "If the earth flows on the river, then what does the sky hangs in?", "ideal": "the sun"}, {"prompt": "If the windmill blows the wind, then what does the grass moisten?", "ideal": "the rain"}, {"prompt": "If the lock unlocks the key, then what does the sheath go in?", "ideal": "the knife or the sword"}, {"prompt": "If the moon means the day, then what does the sun means?", "ideal": "the night"}, {"prompt": "If the black color means bad things, then what does the white color means?", "ideal": "good things"}, {"prompt": "If the black color means a low position, then what does the white color means?", "ideal": "a high position"}, {"prompt": "If the ice feels hot, then what does the fire feel?", "ideal": "cold"}, {"prompt": "If the moon is bigger than the earth, then who is bigger between the earth and the sun?", "ideal": "the earth"}, {"prompt": "If the moon is a cubic object, then what is the shape of the sun?", "ideal": "cube"}, {"prompt": "If chinese food matches Beijing, then what does american food match?", "ideal": "washington"}, {"prompt": "If 1 is less than 2, then is 3 bigger than 4?", "ideal": "yes"}, {"prompt": "If one matches eno, then what does two match?", "ideal": "owt"}], "columns": [{"key": "prompt", "header": "Prompt"}, {"key": "ideal", "header": "Ideal"}]}, "position": {"x": 32, "y": 240}, "selected": true, "positionAbsolute": {"x": 32, "y": 240}, "dragging": false}], "edges": [{"source": "prompt-counterfactual-reasoning", "sourceHandle": "prompt", "target": "eval-counterfactual-reasoning", "targetHandle": "responseBatch", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-prompt-1686756357355prompt-eval-1686756357355responseBatch"}, {"source": "prompt-counterfactual-reasoning", "sourceHandle": "prompt", "target": "inspect-counterfactual-reasoning", "targetHandle": "input", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-prompt-1686756357355prompt-inspect-1686756357355input"}, {"source": "eval-counterfactual-reasoning", "sourceHandle": "output", "target": "vis-counterfactual-reasoning", "targetHandle": "input", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-eval-1686756357355output-vis-1686756357355input"}, {"source": "table-counterfactual-reasoning", "sourceHandle": "Prompt", "target": "prompt-counterfactual-reasoning", "targetHandle": "prompt", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-table-1686756385002Prompt-prompt-1686756357355prompt"}], "viewport": {"x": 144, "y": 37, "zoom": 1}}, "cache": {"eval-1686756357355.json": {}, "inspect-1686756357355.json": {}, "prompt-1686756357355.json": {}, "table-1686756385002.json": {}, "vis-1686756357355.json": {}}}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1 @@
{"flow": {"nodes": [{"width": 312, "height": 311, "id": "prompt-crontab", "type": "prompt", "data": {"prompt": "{prompt}", "n": 1, "llms": [{"key": "aa3c0f03-22bd-416e-af4d-4bf5c4278c99", "settings": {"system_msg": "Generate a cron expression with 5 fields from the given description. Output the cron expression ONLY and make your answer as short as possible.", "temperature": 1, "functions": [], "function_call": "", "top_p": 1, "stop": [], "presence_penalty": 0, "frequency_penalty": 0}, "name": "GPT3.5", "emoji": "\ud83d\ude42", "model": "gpt-3.5-turbo", "base_model": "gpt-3.5-turbo", "temp": 1, "formData": {"shortname": "GPT3.5", "model": "gpt-3.5-turbo", "system_msg": "Generate a cron expression with 5 fields from the given description. Output the cron expression ONLY and make your answer as short as possible.", "temperature": 1, "functions": "", "function_call": "", "top_p": 1, "stop": "", "presence_penalty": 0, "frequency_penalty": 0}}]}, "position": {"x": 448, "y": 224}, "selected": false, "positionAbsolute": {"x": 448, "y": 224}, "dragging": false}, {"width": 333, "height": 182, "id": "eval-crontab", "type": "evaluator", "data": {"code": "def evaluate(response):\n\tideal = response.meta['Ideal']\n\treturn response.text.startswith(ideal)"}, "position": {"x": 820, "y": 150}, "positionAbsolute": {"x": 820, "y": 150}}, {"width": 228, "height": 196, "id": "vis-crontab", "type": "vis", "data": {"input": "eval-crontab"}, "position": {"x": 1200, "y": 250}, "positionAbsolute": {"x": 1200, "y": 250}}, {"width": 302, "height": 260, "id": "inspect-crontab", "type": "inspect", "data": {"input": "prompt-crontab"}, "position": {"x": 820, "y": 400}, "positionAbsolute": {"x": 820, "y": 400}}, {"width": 368, "height": 191, "id": "table-crontab", "type": "table", "data": {"rows": [{"prompt": "At 04:05.", "ideal": "5 4 * * *"}, {"prompt": "At 00:05 in August.", "ideal": "5 0 * 8 *"}, {"prompt": "At 14:15 on day-of-month 1.", "ideal": "15 14 1 * *"}, {"prompt": "At 22:00 on every day-of-week from Monday through Friday.", "ideal": "0 22 * * 1-5"}, {"prompt": "At minute 23 past every 2nd hour from 0 through 20.", "ideal": "23 0-20/2 * * *"}, {"prompt": "At minute 0 past hour 0 and 12 on day-of-month 1 in every 2nd month.", "ideal": "0 0,12 1 */2 *"}, {"prompt": "At 04:00 on every day-of-month from 8 through 14.", "ideal": "0 4 8-14 * *"}, {"prompt": "At 00:00 on day-of-month 1 and 15 and on Wednesday.", "ideal": "0 0 1,15 * 3"}, {"prompt": "At 04:05 on Sunday.", "ideal": "5 4 * * 0"}, {"prompt": "At every minute.", "ideal": "* * * * *"}, {"prompt": "At every 2nd minute.", "ideal": "*/2 * * * *"}, {"prompt": "At every 2nd minute from 1 through 59.", "ideal": "1-59/2 * * * *"}, {"prompt": "At every 3rd minute.", "ideal": "*/3 * * * *"}, {"prompt": "At every 4th minute.", "ideal": "*/4 * * * *"}, {"prompt": "At every 5th minute.", "ideal": "*/5 * * * *"}, {"prompt": "At minute 30.", "ideal": "30 * * * *"}, {"prompt": "At every 30th minute.", "ideal": "*/30 * * * *"}, {"prompt": "At 00:00 on day-of-month 1 in January.", "ideal": "0 0 1 1 *"}, {"prompt": "At 00:00 on day-of-month 1 in every 6th month.", "ideal": "0 0 1 */6 *"}, {"prompt": "At 00:00 on Saturday and Sunday.", "ideal": "0 0 * * 6,0"}, {"prompt": "At 00:00 on day-of-month 1 in every 3rd month.", "ideal": "0 0 1 */3 *"}], "columns": [{"key": "prompt", "header": "Prompt"}, {"key": "ideal", "header": "Ideal"}]}, "position": {"x": 32, "y": 240}, "selected": true, "positionAbsolute": {"x": 32, "y": 240}, "dragging": false}], "edges": [{"source": "prompt-crontab", "sourceHandle": "prompt", "target": "eval-crontab", "targetHandle": "responseBatch", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-prompt-1686756357355prompt-eval-1686756357355responseBatch"}, {"source": "prompt-crontab", "sourceHandle": "prompt", "target": "inspect-crontab", "targetHandle": "input", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-prompt-1686756357355prompt-inspect-1686756357355input"}, {"source": "eval-crontab", "sourceHandle": "output", "target": "vis-crontab", "targetHandle": "input", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-eval-1686756357355output-vis-1686756357355input"}, {"source": "table-crontab", "sourceHandle": "Prompt", "target": "prompt-crontab", "targetHandle": "prompt", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-table-1686756385002Prompt-prompt-1686756357355prompt"}], "viewport": {"x": 144, "y": 37, "zoom": 1}}, "cache": {"eval-1686756357355.json": {}, "inspect-1686756357355.json": {}, "prompt-1686756357355.json": {}, "table-1686756385002.json": {}, "vis-1686756357355.json": {}}}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1 @@
{"flow": {"nodes": [{"width": 312, "height": 311, "id": "prompt-finnish-rhyme", "type": "prompt", "data": {"prompt": "{prompt}", "n": 1, "llms": [{"key": "aa3c0f03-22bd-416e-af4d-4bf5c4278c99", "settings": {"system_msg": "For each pair of words, determine whether their Finnish translations rhyme. If they do, output the pair of rhyming words in Finnish. If not, output NONE.", "temperature": 1, "functions": [], "function_call": "", "top_p": 1, "stop": [], "presence_penalty": 0, "frequency_penalty": 0}, "name": "GPT3.5", "emoji": "\ud83d\ude42", "model": "gpt-3.5-turbo", "base_model": "gpt-3.5-turbo", "temp": 1, "formData": {"shortname": "GPT3.5", "model": "gpt-3.5-turbo", "system_msg": "For each pair of words, determine whether their Finnish translations rhyme. If they do, output the pair of rhyming words in Finnish. If not, output NONE.", "temperature": 1, "functions": "", "function_call": "", "top_p": 1, "stop": "", "presence_penalty": 0, "frequency_penalty": 0}}]}, "position": {"x": 448, "y": 224}, "selected": false, "positionAbsolute": {"x": 448, "y": 224}, "dragging": false}, {"width": 333, "height": 182, "id": "eval-finnish-rhyme", "type": "evaluator", "data": {"code": "def evaluate(response):\n\ttxt = response.text\n\tideal = response.meta['Ideal']\n\treturn txt in ideal or ideal in txt"}, "position": {"x": 820, "y": 150}, "positionAbsolute": {"x": 820, "y": 150}}, {"width": 228, "height": 196, "id": "vis-finnish-rhyme", "type": "vis", "data": {"input": "eval-finnish-rhyme"}, "position": {"x": 1200, "y": 250}, "positionAbsolute": {"x": 1200, "y": 250}}, {"width": 302, "height": 260, "id": "inspect-finnish-rhyme", "type": "inspect", "data": {"input": "prompt-finnish-rhyme"}, "position": {"x": 820, "y": 400}, "positionAbsolute": {"x": 820, "y": 400}}, {"width": 368, "height": 191, "id": "table-finnish-rhyme", "type": "table", "data": {"rows": [{"prompt": "boat, Go!", "ideal": "vene, Mene!"}, {"prompt": "snow, rubber", "ideal": "lumi, kumi"}, {"prompt": "car, tasteless", "ideal": "NONE"}, {"prompt": "flower, sock", "ideal": "kukka, sukka"}, {"prompt": "carpet, motto", "ideal": "NONE"}, {"prompt": "moth, butter", "ideal": "koi, voi"}, {"prompt": "ass (animal), glass", "ideal": "NONE"}, {"prompt": "meringue, spirit", "ideal": "marenki, henki"}, {"prompt": "marble, basket", "ideal": "marmori, kori"}, {"prompt": "sea, blood", "ideal": "meri, veri"}, {"prompt": "sea, pal", "ideal": "meri, kaveri"}, {"prompt": "wall, hay", "ideal": "sein\u00e4, hein\u00e4"}, {"prompt": "bottle, ball", "ideal": "NONE"}, {"prompt": "human being, delicious", "ideal": "ihminen, herkullinen"}, {"prompt": "pill, blood", "ideal": "pilleri, veri"}, {"prompt": "book, Norway", "ideal": "NONE"}, {"prompt": "slave, Norway", "ideal": "orja, Norja"}, {"prompt": "climate, mast", "ideal": "ilmasto, masto"}, {"prompt": "room, machine", "ideal": "NONE"}, {"prompt": "sabbath, watt", "ideal": "sapatti, watti"}], "columns": [{"key": "prompt", "header": "Prompt"}, {"key": "ideal", "header": "Ideal"}]}, "position": {"x": 32, "y": 240}, "selected": true, "positionAbsolute": {"x": 32, "y": 240}, "dragging": false}], "edges": [{"source": "prompt-finnish-rhyme", "sourceHandle": "prompt", "target": "eval-finnish-rhyme", "targetHandle": "responseBatch", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-prompt-1686756357355prompt-eval-1686756357355responseBatch"}, {"source": "prompt-finnish-rhyme", "sourceHandle": "prompt", "target": "inspect-finnish-rhyme", "targetHandle": "input", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-prompt-1686756357355prompt-inspect-1686756357355input"}, {"source": "eval-finnish-rhyme", "sourceHandle": "output", "target": "vis-finnish-rhyme", "targetHandle": "input", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-eval-1686756357355output-vis-1686756357355input"}, {"source": "table-finnish-rhyme", "sourceHandle": "Prompt", "target": "prompt-finnish-rhyme", "targetHandle": "prompt", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-table-1686756385002Prompt-prompt-1686756357355prompt"}], "viewport": {"x": 144, "y": 37, "zoom": 1}}, "cache": {"eval-1686756357355.json": {}, "inspect-1686756357355.json": {}, "prompt-1686756357355.json": {}, "table-1686756385002.json": {}, "vis-1686756357355.json": {}}}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Some files were not shown because too many files have changed in this diff Show More