ChainForge/chainforge/oaievals/stats-tests.cforge
ianarawjo b33397930b
TypeScript backend, HuggingFace models, JavaScript evaluators, Comment Nodes, and more (#81)
* Beginning to convert Python backend to Typescript

* Change all fetch() calls to fetch_from_backend switcher

* wip converting query.py to query.ts

* wip started utils.js conversion. Tested that OpenAI API call works

* more progress on converting utils.py to Typescript

* jest tests for query, utils, template.ts. Confirmed PromptPipeline works.

* wip converting queryLLM in flask_app to TS

* Tested queryLLM and StorageCache compressed saving/loading

* wip execute() in backend.ts

* Added execute() and tested w concrete func. Need to test eval()

* Added craco for optional webpack config. Config'd for TypeScript with Node.js packages browserify'd

* Execute JS code on iframe sandbox

* Tested and working JS Evaluator execution.

* wip swapping backends

* Tested TypeScript backendgit status! :) woot

* Added fetchEnvironAPIKeys to Flask server to fetch os.environ keys when running locally

* Route Anthropic calls through Flask when running locally

* Added info button to Eval nodes. Rebuilt react

* Edits to info modal on Eval node

* Remove/error out on Python eval nodes when not running locally.

* Check browser compat and display error if not supported

* Changed all example flows to use JS. Bug fix in query.ts

* Refactored to LLMProvider to streamline model additions

* Added HuggingFace models API

* Added back Dalai call support, routing through Flask

* Remove flask app calls and socketio server that are no longer used

* Added Comment Nodes. Rebuilt react.

* Fix PaLM temp=0 build, update package vers and rebuild react
2023-06-30 15:11:20 -04:00

1 line
14 KiB
Plaintext

{"flow": {"nodes": [{"width": 312, "height": 311, "id": "prompt-stats-tests", "type": "prompt", "data": {"prompt": "{prompt}", "n": 1, "llms": [{"key": "aa3c0f03-22bd-416e-af4d-4bf5c4278c99", "settings": {"system_msg": "TASK: Read the provided research situation and determine which statistical test is most appropriate for the situation from the list of answer choices. Respond only with one of the letters 'A', 'B', 'C', or 'D', and nothing else.", "temperature": 1, "functions": [], "function_call": "", "top_p": 1, "stop": [], "presence_penalty": 0, "frequency_penalty": 0}, "name": "GPT3.5", "emoji": "\ud83d\ude42", "model": "gpt-3.5-turbo", "base_model": "gpt-3.5-turbo", "temp": 1, "formData": {"shortname": "GPT3.5", "model": "gpt-3.5-turbo", "system_msg": "TASK: Read the provided research situation and determine which statistical test is most appropriate for the situation from the list of answer choices. Respond only with one of the letters 'A', 'B', 'C', or 'D', and nothing else.", "temperature": 1, "functions": "", "function_call": "", "top_p": 1, "stop": "", "presence_penalty": 0, "frequency_penalty": 0}}]}, "position": {"x": 448, "y": 224}, "selected": false, "positionAbsolute": {"x": 448, "y": 224}, "dragging": false}, {"width": 333, "height": 182, "id": "eval-stats-tests", "type": "evaluator", "data": {"code": "function evaluate(response) {\n\tlet ideal = response.meta['Ideal'];\n\treturn response.text.startsWith(ideal);\n}", "language": "javascript"}, "position": {"x": 820, "y": 150}, "positionAbsolute": {"x": 820, "y": 150}}, {"width": 228, "height": 196, "id": "vis-stats-tests", "type": "vis", "data": {"input": "eval-stats-tests"}, "position": {"x": 1200, "y": 250}, "positionAbsolute": {"x": 1200, "y": 250}}, {"width": 302, "height": 260, "id": "inspect-stats-tests", "type": "inspect", "data": {"input": "prompt-stats-tests"}, "position": {"x": 820, "y": 400}, "positionAbsolute": {"x": 820, "y": 400}}, {"width": 423, "height": 417, "id": "table-stats-tests", "type": "table", "data": {"rows": [{"prompt": "Situation: 'Bogton Council decide to see whether performance-related pay would improve morale amongst their lavatory cleaners. Each month, twenty lavatory cleaners are paid on the basis of the length of the bristles on their lavatory brush (on the assumption that the harder they have worked, the shorter their bristles will be). Another twenty are paid their usual near-subsistence-level wages, regardless of how hard they work. After 6 months, each worker is asked to rate how happy they are in their job, using a seven-point scale. Which test would you use to see if performance-related pay has affected workers' morale?' Answer Choices: A: Friedman's test B: Mann-Whitney test C: Wilcoxon test D: independent-measures t-test", "ideal": "B"}, {"prompt": "Situation: 'An experimenter wants to know whether experience affects how well shop-keepers can identify children who ask for cigarettes but are under the legal age for purchasing them. Each of 30 tobacconists is shown a random sequence of 40 photographs of young faces, and asked to decide whether each face is younger or older than the legal age for buying cigarettes. (Half of the faces are aged above the legal age, and half below). The experimenter records the number of correct decisions per participant, and also asks each shop-keeper how long they have been selling cigarettes. (These latter data turn out to be heavily skewed). Which test should the experimenter use to decide whether experience leads to better age-estimation in this group?' Answer Choices: A: Pearson's r B: Spearman's rho C: Kruskal-Wallis test D: Friedman's test", "ideal": "B"}, {"prompt": "Situation: 'It's often said that you're hungry again soon after a Chinese meal. An experimenter puts this to the test. There are four conditions, and each participant does each one, on a different day of the week (order of conditions is counterbalanced across participants). In the first, participants eat an Indian takeaway; in the second, they eat a pizza; in the third, they eat a Chinese takeaway; and in the fourth, they eat a Kentyukky Flayed Chicken takeaway. All the meals are equated for bulk of contents and calorific value. The dependent variable is the loudness of each participants' stomach rumblings (in decibels), measured one hour after they have eaten the meal. These measurements are normally distributed, but much more variable for the \"KFC\" condition than the others. Which test should be used to decide whether there is a difference between these meals in terms of how quickly people get hungry again after eating them?' Answer Choices: A: Friedman's test B: repeated-measures t-test C: one-way independent-measures ANOVA D: Friedman's test", "ideal": "A"}, {"prompt": "Situation: 'Some TV viewers complain to the BBC that Jeremy Clarkson's programme \"Top Gear\" is a bad influence on young drivers, given that it extols the virtues of laddishness, speeding and high performance cars. To determine whether there is any foundation to these claims, a researcher uses a speed camera to measure the speeds of 400 drivers on an A-road, the morning before the programme is transmitted. He follows this procedure again, the morning afterwards. Each car is photographed, so that the experimenter can select only those drivers who travelled that route on both occasions, and hence whose speeds were measured twice. The experimenter subtracts each driver's first speed reading from their second, to get a \"difference score\": a positive score means a driver drove faster on the second occasion, and a negative score means they drove more slowly. The selected drivers were then contacted and asked whether or not they had watched \"Top Gear\" that week. Which test would you use to see whether drivers who watched \"Top Gear\" drove faster the following morning than drivers who did not watch it?' Answer Choices: A: Friedman's test B: repeated-measures t-test C: Pearson's r D: independent-measures t-test", "ideal": "D"}, {"prompt": "Situation: 'A researcher is interested in factors affecting reproductive success in Homo canarywharfensis, an obscure species of proto-human that inhabits high-altitude habitats in a region of south-east London. Once she has acclimatised them to her presence, she traps a hundred of the males and records the price of their suits. She then releases them back into the wild and follows them for a fortnight, recording how many females each one mates with. Is there a relationship between wealth (as reflected in suit price) and reproductive success (as reflected in how many females each male mates with?) The data for reproductive success are heavily skewed, since most of the males attract no females.' Answer Choices: A: Mann-Whitney test B: repeated-measures t-test C: Spearman's rho D: independent-measures t-test", "ideal": "C"}, {"prompt": "Situation: 'The local Sussex ale, Harvey's Best bitter, is reputed to be imbued with truly magical medicinal properties, as well as having an especially delicious flavour, a unique golden colour and a beautiful yeasty head. To investigate its effects, a researcher asks four groups of cyclists to cycle up Ditchling Beacon (the highest point on the South Downs). One group drink no Harvey's beforehand; another group drink one pint of Best each; a third group drink two pints each; and a fourth group drink four pints each. The dependent variable is how fast each cyclist gets from the bottom of the Beacon to the top. Which test would you use to see if drinking Harvey's affects the cyclists' speed of ascent?' Answer Choices: A: Pearson's r B: repeated-measures t-test C: one-way independent-measures ANOVA D: independent-measures t-test", "ideal": "C"}, {"prompt": "Situation: 'It is said that every time someone prints off an email, a penguin dies. To put this to the test, a researcher flies to the South Pole and repeatedly counts the number of penguins, as her colleague at Sussex prints out his emails one at a time. Which test would you use to see if there is a relationship between printing off emails and penguin mortality?' Answer Choices: A: Wilcoxon test B: Kruskal-Wallis test C: Pearson's r D: Friedman's test", "ideal": "C"}, {"prompt": "Situation: 'A researcher investigates four different methods for coping with extreme stress. Each person attempts to assemble an IKEA flat-pack wardrobe (the stress-induction phase of the study), and is then allocated randomly to one of four groups. Those in the first group practise yoga for twenty minutes; those in the second group engage in deep breathing for a similar amount of time; those in the third group spend twenty minutes in a Harvey's pub, drinking Best bitter; and those in the fourth group simply scream at the top of their voice for twenty minutes. Each participant then provides a rating on a 0-10 scale of how stressed they feel. Which test would you use to determine whether the four methods differ in their effectiveness for relieving stress?' Answer Choices: A: Kruskal-Wallis test B: Wilcoxon test C: Mann-Whitney test D: Spearman's rho", "ideal": "A"}, {"prompt": "Situation: 'To determine whether young chldren find \"Dr. Who\" scary, a researcher asks the parents of thirty six-year olds to keep a record of how many nightmares each child has on Saturday night (after watching \"Dr. Who\") and Sunday night (after watching \"Songs of Praise\"). Which test would you use to see if watching \"Dr. Who\" is associated with more nightmares than watching \"Songs of Praise\"?' Answer Choices: A: Chi-Square test of association B: repeated-measures t-test C: Spearman's rho D: one-way independent-measures ANOVA", "ideal": "B"}, {"prompt": "Situation: 'To determine whether young chldren find \"Dr. Who\" scary, a researcher asks the parents of thirty six-year olds to rate how frightened they think their child is on Saturday night (after watching \"Dr. Who\") and Sunday night (after watching \"Songs of Praise\"). Which test would you use to see if parents think their children are more frightened by watching \"Dr. Who\" than by watching \"Songs of Praise\"?' Answer Choices: A: Wilcoxon test B: independent-measures t-test C: Wilcoxon test D: Spearman's rho", "ideal": "A"}, {"prompt": "Situation: '200 men and 150 women are asked to decide which one of the following features is most important to them when they choose a new car: price, performance, safety level, roominess, or colour. Which test would you use to see if men and women differ in their preferences?' Answer Choices: A: Kruskal-Wallis test B: Chi-Square test of association C: Chi-Square test of association D: repeated-measures t-test", "ideal": "B"}, {"prompt": "Situation: 'An experimenter investigates the accuracy of fortune-tellers' predictions. She asks each of fifteen fortune-tellers, and each of twenty students, to make ten specific predictions about what will happen to her in the next month. She then records, for each of these people, how many of these predictions come true. Which test should she use to see if the fortune-tellers are more accurate in their predictions than the students?' Answer Choices: A: Chi-Square test of association B: Kruskal-Wallis test C: independent-measures t-test D: one-way independent-measures ANOVA", "ideal": "C"}, {"prompt": "Situation: 'The experimenter from the previous study returns to each of the participants and tells them that none of their predictions came true. She then asks each of the participants to estimate their level of psychic ability on a seven-point scale. Which test should the experimenter use to determine whether this negative feedback about their performance affects the fortune-tellers and students differently?' Answer Choices: A: repeated-measures t-test B: Friedman's test C: Mann-Whitney test D: independent-measures t-test", "ideal": "C"}, {"prompt": "Situation: 'A study looks at the effectiveness of TV adverts in relation to their position in the adbreak between programmes. There are three conditions. All participants see the same advert, for \"Churn Flakes\", but for one group the advert comes at the start of the adbreak; for a second group, it comes in the middle; and for the third group it comes at the end, just before the next program begins. A week later, each participant returns to the lab and sees a sequence of photographs of breakfast cereal boxes, including the box for \"Churn Flakes\". Their task is to rate each cereal in terms of how much they like it, using a seven-point scale.' Answer Choices: A: independent-measures t-test B: Friedman's test C: Kruskal-Wallis test D: Chi-Square test of association", "ideal": "C"}, {"prompt": "Situation: 'While sales of traditional classical music CD's are falling, \"cross-over\" classical performers who sacrifice their integrity for money by producing populist versions of tunes like \"Nessun Dorma\" are big business. The CD sales of twenty opera singers are examined: ten of these singers are rated as \"ugly\" by a panel of independent judges, and twenty are rated as \"highly attractive\". Is the success of these perfomers related to their physical attractiveness?' Answer Choices: A: one-way independent-measures ANOVA B: Spearman's rho C: Wilcoxon test D: independent-measures t-test", "ideal": "D"}], "columns": [{"key": "prompt", "header": "Prompt"}, {"key": "ideal", "header": "Ideal"}]}, "position": {"x": -16, "y": 160}, "selected": false, "positionAbsolute": {"x": -16, "y": 160}, "dragging": false}], "edges": [{"source": "prompt-stats-tests", "sourceHandle": "prompt", "target": "eval-stats-tests", "targetHandle": "responseBatch", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-prompt-1686756357355prompt-eval-1686756357355responseBatch"}, {"source": "prompt-stats-tests", "sourceHandle": "prompt", "target": "inspect-stats-tests", "targetHandle": "input", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-prompt-1686756357355prompt-inspect-1686756357355input"}, {"source": "eval-stats-tests", "sourceHandle": "output", "target": "vis-stats-tests", "targetHandle": "input", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-eval-1686756357355output-vis-1686756357355input"}, {"source": "table-stats-tests", "sourceHandle": "Prompt", "target": "prompt-stats-tests", "targetHandle": "prompt", "interactionWidth": 100, "markerEnd": {"type": "arrow", "width": "22px", "height": "22px"}, "id": "reactflow__edge-table-1686756385002Prompt-prompt-1686756357355prompt"}], "viewport": {"x": 144, "y": 37, "zoom": 1}}, "cache": {"eval-1686756357355.json": {}, "inspect-1686756357355.json": {}, "prompt-1686756357355.json": {}, "table-1686756385002.json": {}, "vis-1686756357355.json": {}}}