diff --git a/docs/benchmark.md b/docs/benchmark.md index a79f116b6c..8e41eda6dd 100644 --- a/docs/benchmark.md +++ b/docs/benchmark.md @@ -11,10 +11,10 @@ directory exists. The `helm-run` provides several flags that can be used to test that the configuration and scenario are working correctly without actually sending requests to the model # Just load the config file - helm-run --conf src/helm/benchmark/presentation/run_specs_small.conf --max-eval-instances 10 --suite v1 --skip-instances + helm-run --conf src/helm/benchmark/presentation/run_entries_small.conf --max-eval-instances 10 --suite v1 --skip-instances # Create the instances and the requests, but don't send requests to the model - helm-run --conf src/helm/benchmark/presentation/run_specs_small.conf --max-eval-instances 10 --suite v1 --dry-run + helm-run --conf src/helm/benchmark/presentation/run_entries_small.conf --max-eval-instances 10 --suite v1 --dry-run ## Estimating Token Usage diff --git a/docs/get_helm_rank.md b/docs/get_helm_rank.md index bd7a828c50..f56e217543 100644 --- a/docs/get_helm_rank.md +++ b/docs/get_helm_rank.md @@ -44,14 +44,14 @@ export MODEL_TO_RUN=huggingface/gpt2 That's it, run the following to get the config file: ```bash -wget https://raw.githubusercontent.com/stanford-crfm/helm/main/src/helm/benchmark/presentation/run_specs_core_scenarios_$EXAMPLES_PER_SCENARIO.conf -O run_specs_$EXAMPLES_PER_SCENARIO.conf +wget https://raw.githubusercontent.com/stanford-crfm/helm/main/src/helm/benchmark/presentation/run_entries_core_scenarios_$EXAMPLES_PER_SCENARIO.conf -O run_entries_$EXAMPLES_PER_SCENARIO.conf ``` and this one to run the benchmark (will take some time in the first time since all the data has to be prepared): ```bash helm-run \ ---conf-paths run_specs_$EXAMPLES_PER_SCENARIO.conf \ +--conf-paths run_entries_$EXAMPLES_PER_SCENARIO.conf \ --suite $LEADERBOARD_VERSION \ --max-eval-instances $EXAMPLES_PER_SCENARIO \ --models-to-run $MODEL_TO_RUN \ diff --git a/docs/heim.md b/docs/heim.md index 562949ad0c..f5aaba19bf 100644 --- a/docs/heim.md +++ b/docs/heim.md @@ -6,11 +6,11 @@ To run HEIM, follow these steps: [Stable Diffusion v1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4) against the [MS-COCO scenario](https://github.com/stanford-crfm/heim/blob/main/src/helm/benchmark/scenarios/image_generation/mscoco_scenario.py), run: ``` -echo 'entries: [{description: "mscoco:model=huggingface/stable-diffusion-v1-4", priority: 1}]' > run_specs.conf +echo 'entries: [{description: "mscoco:model=huggingface/stable-diffusion-v1-4", priority: 1}]' > run_entries.conf ``` 2. Run the benchmark with certain number of instances (e.g., 10 instances): -`helm-run --conf-paths run_specs.conf --suite heim_v1 --max-eval-instances 10` +`helm-run --conf-paths run_entries.conf --suite heim_v1 --max-eval-instances 10` Examples of run specs configuration files can be found [here](https://github.com/stanford-crfm/helm/tree/main/src/helm/benchmark/presentation). -We used [this configuration file](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_specs_heim.conf) +We used [this configuration file](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_entries_heim.conf) to produce results of the paper. diff --git a/docs/quick_start.md b/docs/quick_start.md index c6a7f83c9e..3a1fb2f561 100644 --- a/docs/quick_start.md +++ b/docs/quick_start.md @@ -4,10 +4,10 @@ Run the following: ``` # Create a run specs configuration -echo 'entries: [{description: "mmlu:subject=philosophy,model=openai/gpt2", priority: 1}]' > run_specs.conf +echo 'entries: [{description: "mmlu:subject=philosophy,model=openai/gpt2", priority: 1}]' > run_entries.conf # Run benchmark -helm-run --conf-paths run_specs.conf --suite v1 --max-eval-instances 10 +helm-run --conf-paths run_entries.conf --suite v1 --max-eval-instances 10 # Summarize benchmark results helm-summarize --suite v1 diff --git a/docs/tutorial.md b/docs/tutorial.md index 5483fccd3f..c000825d9a 100644 --- a/docs/tutorial.md +++ b/docs/tutorial.md @@ -10,7 +10,7 @@ We will run two runs using the `mmlu` scenario on the `openai/gpt2` model. The ` To run this benchmark using the HELM command-line tools, we need to specify **run spec descriptions** that describes the desired runs. For this example, the run spec descriptions are `mmlu:subject=anatomy,model=openai/gpt2` (for anatomy) and `mmlu:subject=philosophy,model=openai/gpt2` (for philosophy). -Next, we need to create a **run spec configuration file** containing these run spec descriptions. A run spec configuration file is a text file containing `RunEntries` serialized to JSON, where each entry in `RunEntries` contains a run spec description. The `description` field of each entry should be a **run spec description**. Create a text file named `run_specs.conf` with the following contents: +Next, we need to create a **run spec configuration file** containing these run spec descriptions. A run spec configuration file is a text file containing `RunEntries` serialized to JSON, where each entry in `RunEntries` contains a run spec description. The `description` field of each entry should be a **run spec description**. Create a text file named `run_entries.conf` with the following contents: ``` entries: [ @@ -22,7 +22,7 @@ entries: [ We will now use `helm-run` to execute the runs that have been specified in this run spec configuration file. Run this command: ``` -helm-run --conf-paths run_specs.conf --suite v1 --max-eval-instances 10 +helm-run --conf-paths run_entries.conf --suite v1 --max-eval-instances 10 ``` The meaning of the additional arguments are as follows: @@ -45,7 +45,7 @@ Each output sub-directory will contain several JSON files that were generated du - `per_instance_stats.json` contains a serialized list of `PerInstanceStats`, which contains the statistics produced for the metrics for each instance (i.e. input). - `stats.json` contains a serialized list of `PerInstanceStats`, which contains the statistics produced for the metrics, aggregated across all instances (i.e. inputs). -`helm-run` provides additional arguments that can be used to filter out `--models-to-run`, `--groups-to-run` and `--priority`. It can be convenient to create a large `run_specs.conf` file containing every run spec description of interest, and then use these flags to filter down the RunSpecs to actually run. As an example, the main `run_specs.conf` file used for the HELM benchmarking paper can be found [here](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_specs.conf). +`helm-run` provides additional arguments that can be used to filter out `--models-to-run`, `--groups-to-run` and `--priority`. It can be convenient to create a large `run_entries.conf` file containing every run spec description of interest, and then use these flags to filter down the RunSpecs to actually run. As an example, the main `run_specs.conf` file used for the HELM benchmarking paper can be found [here](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_specs.conf). **Using model or model_deployment:** Some models have several deployments (for exmaple `eleutherai/gpt-j-6b` is deployed under `huggingface/gpt-j-6b`, `gooseai/gpt-j-6b` and `together/gpt-j-6b`). Since the results can differ depending on the deployment, we provide a way to specify the deployment instead of the model. Instead of using `model=eleutherai/gpt-g-6b`, use `model_deployment=huggingface/gpt-j-6b`. If you do not, a deployment will be arbitrarily chosen. This can still be used for models that have a single deployment and is a good practice to follow to avoid any ambiguity. diff --git a/scripts/helm-run-all.sh b/scripts/helm-run-all.sh index 92c2896989..10c9146f19 100644 --- a/scripts/helm-run-all.sh +++ b/scripts/helm-run-all.sh @@ -67,7 +67,7 @@ do logfile="${logfile// /_}" # Replace spaces # Override with passed-in CLI arguments - # By default, the command will run the RunSpecs listed in src/helm/benchmark/presentation/run_specs.conf + # By default, the command will run the RunSpecs listed in src/helm/benchmark/presentation/run_entries.conf # and output results to `benchmark_output/runs/`. execute "helm-run --models-to-run $model $* &> $logfile.log &" done diff --git a/scripts/verify_reproducibility.py b/scripts/verify_reproducibility.py index c1eee8475a..1b59e73db2 100644 --- a/scripts/verify_reproducibility.py +++ b/scripts/verify_reproducibility.py @@ -126,7 +126,7 @@ def verify_reproducibility( "--conf-path", type=str, help="Where to read RunSpecs to run from", - default="src/helm/benchmark/presentation/run_specs.conf", + default="src/helm/benchmark/presentation/run_entries.conf", ) parser.add_argument( "--models-to-run", diff --git a/src/helm/benchmark/presentation/run_specs.conf b/src/helm/benchmark/presentation/run_entries.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs.conf rename to src/helm/benchmark/presentation/run_entries.conf diff --git a/src/helm/benchmark/presentation/run_specs_big_bench_lite.conf b/src/helm/benchmark/presentation/run_entries_big_bench_lite.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_big_bench_lite.conf rename to src/helm/benchmark/presentation/run_entries_big_bench_lite.conf diff --git a/src/helm/benchmark/presentation/run_specs_biomedical.conf b/src/helm/benchmark/presentation/run_entries_biomedical.conf similarity index 97% rename from src/helm/benchmark/presentation/run_specs_biomedical.conf rename to src/helm/benchmark/presentation/run_entries_biomedical.conf index b0d2fab3e8..4181e3dd8d 100644 --- a/src/helm/benchmark/presentation/run_specs_biomedical.conf +++ b/src/helm/benchmark/presentation/run_entries_biomedical.conf @@ -1,5 +1,5 @@ # Biomedical RunSpecs -# helm-run --suite biomed --conf-path src/helm/benchmark/presentation/run_specs_biomedical.conf -m 1000 +# helm-run --suite biomed --conf-path src/helm/benchmark/presentation/run_entries_biomedical.conf -m 1000 entries: [ ######################################################### NLU ###################################################### diff --git a/src/helm/benchmark/presentation/run_specs_cleva_v1.conf b/src/helm/benchmark/presentation/run_entries_cleva_v1.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_cleva_v1.conf rename to src/helm/benchmark/presentation/run_entries_cleva_v1.conf diff --git a/src/helm/benchmark/presentation/run_specs_core_scenarios_10.conf b/src/helm/benchmark/presentation/run_entries_core_scenarios_10.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_core_scenarios_10.conf rename to src/helm/benchmark/presentation/run_entries_core_scenarios_10.conf diff --git a/src/helm/benchmark/presentation/run_specs_core_scenarios_100.conf b/src/helm/benchmark/presentation/run_entries_core_scenarios_100.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_core_scenarios_100.conf rename to src/helm/benchmark/presentation/run_entries_core_scenarios_100.conf diff --git a/src/helm/benchmark/presentation/run_specs_core_scenarios_1000.conf b/src/helm/benchmark/presentation/run_entries_core_scenarios_1000.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_core_scenarios_1000.conf rename to src/helm/benchmark/presentation/run_entries_core_scenarios_1000.conf diff --git a/src/helm/benchmark/presentation/run_specs_core_scenarios_20.conf b/src/helm/benchmark/presentation/run_entries_core_scenarios_20.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_core_scenarios_20.conf rename to src/helm/benchmark/presentation/run_entries_core_scenarios_20.conf diff --git a/src/helm/benchmark/presentation/run_specs_core_scenarios_50.conf b/src/helm/benchmark/presentation/run_entries_core_scenarios_50.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_core_scenarios_50.conf rename to src/helm/benchmark/presentation/run_entries_core_scenarios_50.conf diff --git a/src/helm/benchmark/presentation/run_specs_core_scenarios_all.conf b/src/helm/benchmark/presentation/run_entries_core_scenarios_all.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_core_scenarios_all.conf rename to src/helm/benchmark/presentation/run_entries_core_scenarios_all.conf diff --git a/src/helm/benchmark/presentation/run_specs_dec2023.conf b/src/helm/benchmark/presentation/run_entries_dec2023.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_dec2023.conf rename to src/helm/benchmark/presentation/run_entries_dec2023.conf diff --git a/src/helm/benchmark/presentation/run_specs_decodingtrust.conf b/src/helm/benchmark/presentation/run_entries_decodingtrust.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_decodingtrust.conf rename to src/helm/benchmark/presentation/run_entries_decodingtrust.conf diff --git a/src/helm/benchmark/presentation/run_specs_extra.conf b/src/helm/benchmark/presentation/run_entries_extra.conf similarity index 99% rename from src/helm/benchmark/presentation/run_specs_extra.conf rename to src/helm/benchmark/presentation/run_entries_extra.conf index 9568075aba..def7121da5 100644 --- a/src/helm/benchmark/presentation/run_specs_extra.conf +++ b/src/helm/benchmark/presentation/run_entries_extra.conf @@ -32,7 +32,7 @@ entries: [ {description: "truthful_qa:model=ablation_full_functionality_text,task=mc_single,method=multiple_choice_separate_original", priority: 2, groups: ["ablation_multiple_choice"]} {description: "truthful_qa:model=ablation_full_functionality_text,task=mc_single,method=multiple_choice_separate_calibrated", priority: 2, groups: ["ablation_multiple_choice"]} - # MMLU priorities follow the main `run_specs.conf` with 2 -> 1 and 4 -> 3 + # MMLU priorities follow the main `run_entries.conf` with 2 -> 1 and 4 -> 3 {description: "mmlu:model=ablation_full_functionality_text,subject=abstract_algebra,method=multiple_choice_joint", priority: 1, groups: ["ablation_multiple_choice"]} {description: "mmlu:model=ablation_full_functionality_text,subject=abstract_algebra,method=multiple_choice_separate_original", priority: 1, groups: ["ablation_multiple_choice"]} {description: "mmlu:model=ablation_full_functionality_text,subject=abstract_algebra,method=multiple_choice_separate_calibrated", priority: 1, groups: ["ablation_multiple_choice"]} @@ -357,14 +357,14 @@ entries: [ # {description: "raft:subset=tai_safety_research,model=ablation_text,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]} # {description: "raft:subset=terms_of_service,model=ablation_text,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]} - # mmlu (only subjects with priority <= 2 in run_specs.conf) + # mmlu (only subjects with priority <= 2 in run_entries.conf) # {description: "mmlu:model=ablation_text,subject=abstract_algebra,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]} # {description: "mmlu:model=ablation_text,subject=college_chemistry,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]} # {description: "mmlu:model=ablation_text,subject=computer_security,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]} # {description: "mmlu:model=ablation_text,subject=econometrics,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]} # {description: "mmlu:model=ablation_text,subject=us_foreign_policy,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]} - # civil comments (only subjects with priority <= 2 in run_specs.conf) + # civil comments (only subjects with priority <= 2 in run_entries.conf) # {description: "civil_comments:model=ablation_text,demographic=all,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]} # {description: "civil_comments:model=ablation_text,demographic=male,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]} # {description: "civil_comments:model=ablation_text,demographic=female,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]} diff --git a/src/helm/benchmark/presentation/run_specs_gpu.conf b/src/helm/benchmark/presentation/run_entries_gpu.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_gpu.conf rename to src/helm/benchmark/presentation/run_entries_gpu.conf diff --git a/src/helm/benchmark/presentation/run_specs_heim.conf b/src/helm/benchmark/presentation/run_entries_heim.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_heim.conf rename to src/helm/benchmark/presentation/run_entries_heim.conf diff --git a/src/helm/benchmark/presentation/run_specs_heim_debug.conf b/src/helm/benchmark/presentation/run_entries_heim_debug.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_heim_debug.conf rename to src/helm/benchmark/presentation/run_entries_heim_debug.conf diff --git a/src/helm/benchmark/presentation/run_specs_heim_human.conf b/src/helm/benchmark/presentation/run_entries_heim_human.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_heim_human.conf rename to src/helm/benchmark/presentation/run_entries_heim_human.conf diff --git a/src/helm/benchmark/presentation/run_specs_heim_human_eval.conf b/src/helm/benchmark/presentation/run_entries_heim_human_eval.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_heim_human_eval.conf rename to src/helm/benchmark/presentation/run_entries_heim_human_eval.conf diff --git a/src/helm/benchmark/presentation/run_specs_image2structure.conf b/src/helm/benchmark/presentation/run_entries_image2structure.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_image2structure.conf rename to src/helm/benchmark/presentation/run_entries_image2structure.conf diff --git a/src/helm/benchmark/presentation/run_specs_interactive_qa.conf b/src/helm/benchmark/presentation/run_entries_interactive_qa.conf similarity index 92% rename from src/helm/benchmark/presentation/run_specs_interactive_qa.conf rename to src/helm/benchmark/presentation/run_entries_interactive_qa.conf index cc641dda97..3cfa8b8f09 100644 --- a/src/helm/benchmark/presentation/run_specs_interactive_qa.conf +++ b/src/helm/benchmark/presentation/run_entries_interactive_qa.conf @@ -1,6 +1,6 @@ # MMLU subjects used for InteractiveQA. Run: # helm-run --priority 1 --suite interactive_qa_mmlu --num-threads 1 --num-train-trials 3 -# --conf-path src/helm/benchmark/presentation/run_specs_interactive_qa.conf --max-eval-instances 10 +# --conf-path src/helm/benchmark/presentation/run_entries_interactive_qa.conf --max-eval-instances 10 entries: [ {description: "interactive_qa_mmlu:model=interactive_qa,subject=college_chemistry", priority: 1} diff --git a/src/helm/benchmark/presentation/run_specs_lite.conf b/src/helm/benchmark/presentation/run_entries_lite.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_lite.conf rename to src/helm/benchmark/presentation/run_entries_lite.conf diff --git a/src/helm/benchmark/presentation/run_specs_opinions_qa_ai21_default.conf b/src/helm/benchmark/presentation/run_entries_opinions_qa_ai21_default.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_opinions_qa_ai21_default.conf rename to src/helm/benchmark/presentation/run_entries_opinions_qa_ai21_default.conf diff --git a/src/helm/benchmark/presentation/run_specs_opinions_qa_ai21_steer.conf b/src/helm/benchmark/presentation/run_entries_opinions_qa_ai21_steer.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_opinions_qa_ai21_steer.conf rename to src/helm/benchmark/presentation/run_entries_opinions_qa_ai21_steer.conf diff --git a/src/helm/benchmark/presentation/run_specs_opinions_qa_openai_default.conf b/src/helm/benchmark/presentation/run_entries_opinions_qa_openai_default.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_opinions_qa_openai_default.conf rename to src/helm/benchmark/presentation/run_entries_opinions_qa_openai_default.conf diff --git a/src/helm/benchmark/presentation/run_specs_opinions_qa_openai_steer.conf b/src/helm/benchmark/presentation/run_entries_opinions_qa_openai_steer.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_opinions_qa_openai_steer.conf rename to src/helm/benchmark/presentation/run_entries_opinions_qa_openai_steer.conf diff --git a/src/helm/benchmark/presentation/run_specs_small.conf b/src/helm/benchmark/presentation/run_entries_small.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_small.conf rename to src/helm/benchmark/presentation/run_entries_small.conf diff --git a/src/helm/benchmark/presentation/run_specs_tiny.conf b/src/helm/benchmark/presentation/run_entries_tiny.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_tiny.conf rename to src/helm/benchmark/presentation/run_entries_tiny.conf diff --git a/src/helm/benchmark/presentation/run_specs_vhelm.conf b/src/helm/benchmark/presentation/run_entries_vhelm.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_vhelm.conf rename to src/helm/benchmark/presentation/run_entries_vhelm.conf diff --git a/src/helm/benchmark/presentation/run_specs_vhelm_lite.conf b/src/helm/benchmark/presentation/run_entries_vhelm_lite.conf similarity index 100% rename from src/helm/benchmark/presentation/run_specs_vhelm_lite.conf rename to src/helm/benchmark/presentation/run_entries_vhelm_lite.conf