stanford-crfm · yifanmai · Apr 3, 2024 · Mar 4, 2024 · Apr 3, 2024
diff --git a/docs/benchmark.md b/docs/benchmark.md
@@ -11,10 +11,10 @@ directory exists.
 The `helm-run` provides several flags that can be used to test that the configuration and scenario are working correctly without actually sending requests to the model
 
     # Just load the config file
-    helm-run --conf src/helm/benchmark/presentation/run_specs_small.conf --max-eval-instances 10 --suite v1 --skip-instances
+    helm-run --conf src/helm/benchmark/presentation/run_entries_small.conf --max-eval-instances 10 --suite v1 --skip-instances
 
     # Create the instances and the requests, but don't send requests to the model
-    helm-run --conf src/helm/benchmark/presentation/run_specs_small.conf --max-eval-instances 10  --suite v1 --dry-run
+    helm-run --conf src/helm/benchmark/presentation/run_entries_small.conf --max-eval-instances 10  --suite v1 --dry-run
 
 ## Estimating Token Usage
 

diff --git a/docs/get_helm_rank.md b/docs/get_helm_rank.md
@@ -44,14 +44,14 @@ export MODEL_TO_RUN=huggingface/gpt2
 That's it, run the following to get the config file:
 
 ```bash
-wget https://github.com/stanford-crfm/helm/main/src/helm/benchmark/presentation/run_specs_core_scenarios_$EXAMPLES_PER_SCENARIO.conf -O run_specs_$EXAMPLES_PER_SCENARIO.conf
+wget https://github.com/stanford-crfm/helm/main/src/helm/benchmark/presentation/run_entries_core_scenarios_$EXAMPLES_PER_SCENARIO.conf -O run_entries_$EXAMPLES_PER_SCENARIO.conf
 ```
 
 and this one to run the benchmark (will take some time in the first time since all the data has to be prepared):
 
 ```bash
 helm-run \
---conf-paths run_specs_$EXAMPLES_PER_SCENARIO.conf \
+--conf-paths run_entries_$EXAMPLES_PER_SCENARIO.conf \
 --suite $LEADERBOARD_VERSION \
 --max-eval-instances $EXAMPLES_PER_SCENARIO \
 --models-to-run $MODEL_TO_RUN \

diff --git a/docs/heim.md b/docs/heim.md
@@ -6,11 +6,11 @@ To run HEIM, follow these steps:
 [Stable Diffusion v1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4) against the 
 [MS-COCO scenario](https://github.com/stanford-crfm/heim/blob/main/src/helm/benchmark/scenarios/image_generation/mscoco_scenario.py), run:
 ```
-echo 'entries: [{description: "mscoco:model=huggingface/stable-diffusion-v1-4", priority: 1}]' > run_specs.conf
+echo 'entries: [{description: "mscoco:model=huggingface/stable-diffusion-v1-4", priority: 1}]' > run_entries.conf
 ```
 2. Run the benchmark with certain number of instances (e.g., 10 instances): 
-`helm-run --conf-paths run_specs.conf --suite heim_v1 --max-eval-instances 10`
+`helm-run --conf-paths run_entries.conf --suite heim_v1 --max-eval-instances 10`
 
 Examples of run specs configuration files can be found [here](https://github.com/stanford-crfm/helm/tree/main/src/helm/benchmark/presentation).
-We used [this configuration file](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_specs_heim.conf) 
+We used [this configuration file](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_entries_heim.conf) 
 to produce results of the paper.
diff --git a/docs/quick_start.md b/docs/quick_start.md
@@ -4,10 +4,10 @@ Run the following:
 
 ```
 # Create a run specs configuration
-echo 'entries: [{description: "mmlu:subject=philosophy,model=openai/gpt2", priority: 1}]' > run_specs.conf
+echo 'entries: [{description: "mmlu:subject=philosophy,model=openai/gpt2", priority: 1}]' > run_entries.conf
 
 # Run benchmark
-helm-run --conf-paths run_specs.conf --suite v1 --max-eval-instances 10
+helm-run --conf-paths run_entries.conf --suite v1 --max-eval-instances 10
 
 # Summarize benchmark results
 helm-summarize --suite v1

diff --git a/docs/tutorial.md b/docs/tutorial.md
@@ -10,7 +10,7 @@ We will run two runs using the `mmlu` scenario on the `openai/gpt2` model. The `
 
 To run this benchmark using the HELM command-line tools, we need to specify **run spec descriptions** that describes the desired runs. For this example, the run spec descriptions are `mmlu:subject=anatomy,model=openai/gpt2` (for anatomy) and `mmlu:subject=philosophy,model=openai/gpt2` (for philosophy).
 
-Next, we need to create a **run spec configuration file** containing these run spec descriptions. A run spec configuration file is a text file containing `RunEntries` serialized to JSON, where each entry in `RunEntries` contains a run spec description. The `description` field of each entry should be a **run spec description**. Create a text file named `run_specs.conf` with the following contents:
+Next, we need to create a **run spec configuration file** containing these run spec descriptions. A run spec configuration file is a text file containing `RunEntries` serialized to JSON, where each entry in `RunEntries` contains a run spec description. The `description` field of each entry should be a **run spec description**. Create a text file named `run_entries.conf` with the following contents:
 
 ```
 entries: [
@@ -22,7 +22,7 @@ entries: [
 We will now use `helm-run` to execute the runs that have been specified in this run spec configuration file. Run this command:
 
 ```
-helm-run --conf-paths run_specs.conf --suite v1 --max-eval-instances 10
+helm-run --conf-paths run_entries.conf --suite v1 --max-eval-instances 10
 ```
 
 The meaning of the additional arguments are as follows:
@@ -45,7 +45,7 @@ Each output sub-directory will contain several JSON files that were generated du
 - `per_instance_stats.json` contains a serialized list of `PerInstanceStats`, which contains the statistics produced for the metrics for each instance (i.e. input).
 - `stats.json` contains a serialized list of `PerInstanceStats`, which contains the statistics produced for the metrics, aggregated across all instances (i.e. inputs).
 
-`helm-run` provides additional arguments that can be used to filter out `--models-to-run`, `--groups-to-run` and `--priority`. It can be convenient to create a large `run_specs.conf` file containing every run spec description of interest, and then use these flags to filter down the RunSpecs to actually run. As an example, the main `run_specs.conf` file used for the HELM benchmarking paper can be found [here](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_specs.conf).
+`helm-run` provides additional arguments that can be used to filter out `--models-to-run`, `--groups-to-run` and `--priority`. It can be convenient to create a large `run_entries.conf` file containing every run spec description of interest, and then use these flags to filter down the RunSpecs to actually run. As an example, the main `run_specs.conf` file used for the HELM benchmarking paper can be found [here](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_specs.conf).
 
 **Using model or model_deployment:** Some models have several deployments (for exmaple `eleutherai/gpt-j-6b` is deployed under `huggingface/gpt-j-6b`, `gooseai/gpt-j-6b` and `together/gpt-j-6b`). Since the results can differ depending on the deployment, we provide a way to specify the deployment instead of the model. Instead of using `model=eleutherai/gpt-g-6b`, use `model_deployment=huggingface/gpt-j-6b`. If you do not, a deployment will be arbitrarily chosen. This can still be used for models that have a single deployment and is a good practice to follow to avoid any ambiguity.
 

diff --git a/scripts/helm-run-all.sh b/scripts/helm-run-all.sh
@@ -67,7 +67,7 @@ do
     logfile="${logfile// /_}"   # Replace spaces
 
     # Override with passed-in CLI arguments
-    # By default, the command will run the RunSpecs listed in src/helm/benchmark/presentation/run_specs.conf
+    # By default, the command will run the RunSpecs listed in src/helm/benchmark/presentation/run_entries.conf
     # and output results to `benchmark_output/runs/<Today's date e.g., 06-28-2022>`.
     execute "helm-run --models-to-run $model $* &> $logfile.log &"
 done
diff --git a/scripts/verify_reproducibility.py b/scripts/verify_reproducibility.py
@@ -126,7 +126,7 @@ def verify_reproducibility(
         "--conf-path",
         type=str,
         help="Where to read RunSpecs to run from",
-        default="src/helm/benchmark/presentation/run_specs.conf",
+        default="src/helm/benchmark/presentation/run_entries.conf",
     )
     parser.add_argument(
         "--models-to-run",

diff --git a/...elm/benchmark/presentation/run_specs.conf → ...m/benchmark/presentation/run_entries.conf b/...elm/benchmark/presentation/run_specs.conf → ...m/benchmark/presentation/run_entries.conf
diff --git a/...resentation/run_specs_big_bench_lite.conf → ...sentation/run_entries_big_bench_lite.conf b/...resentation/run_specs_big_bench_lite.conf → ...sentation/run_entries_big_bench_lite.conf
diff --git a/...rk/presentation/run_specs_biomedical.conf → .../presentation/run_entries_biomedical.conf b/...rk/presentation/run_specs_biomedical.conf → .../presentation/run_entries_biomedical.conf
@@ -1,5 +1,5 @@
 # Biomedical RunSpecs
-# helm-run --suite biomed --conf-path src/helm/benchmark/presentation/run_specs_biomedical.conf -m 1000
+# helm-run --suite biomed --conf-path src/helm/benchmark/presentation/run_entries_biomedical.conf -m 1000
 
 entries: [
     ######################################################### NLU ######################################################

diff --git a/...mark/presentation/run_specs_cleva_v1.conf → ...rk/presentation/run_entries_cleva_v1.conf b/...mark/presentation/run_specs_cleva_v1.conf → ...rk/presentation/run_entries_cleva_v1.conf
diff --git a/...entation/run_specs_core_scenarios_10.conf → ...tation/run_entries_core_scenarios_10.conf b/...entation/run_specs_core_scenarios_10.conf → ...tation/run_entries_core_scenarios_10.conf
diff --git a/...ntation/run_specs_core_scenarios_100.conf → ...ation/run_entries_core_scenarios_100.conf b/...ntation/run_specs_core_scenarios_100.conf → ...ation/run_entries_core_scenarios_100.conf
diff --git a/...tation/run_specs_core_scenarios_1000.conf → ...tion/run_entries_core_scenarios_1000.conf b/...tation/run_specs_core_scenarios_1000.conf → ...tion/run_entries_core_scenarios_1000.conf
diff --git a/...entation/run_specs_core_scenarios_20.conf → ...tation/run_entries_core_scenarios_20.conf b/...entation/run_specs_core_scenarios_20.conf → ...tation/run_entries_core_scenarios_20.conf
diff --git a/...entation/run_specs_core_scenarios_50.conf → ...tation/run_entries_core_scenarios_50.conf b/...entation/run_specs_core_scenarios_50.conf → ...tation/run_entries_core_scenarios_50.conf
diff --git a/...ntation/run_specs_core_scenarios_all.conf → ...ation/run_entries_core_scenarios_all.conf b/...ntation/run_specs_core_scenarios_all.conf → ...ation/run_entries_core_scenarios_all.conf
diff --git a/...hmark/presentation/run_specs_dec2023.conf → ...ark/presentation/run_entries_dec2023.conf b/...hmark/presentation/run_specs_dec2023.conf → ...ark/presentation/run_entries_dec2023.conf
diff --git a/...presentation/run_specs_decodingtrust.conf → ...esentation/run_entries_decodingtrust.conf b/...presentation/run_specs_decodingtrust.conf → ...esentation/run_entries_decodingtrust.conf
diff --git a/...nchmark/presentation/run_specs_extra.conf → ...hmark/presentation/run_entries_extra.conf b/...nchmark/presentation/run_specs_extra.conf → ...hmark/presentation/run_entries_extra.conf
@@ -32,7 +32,7 @@ entries: [
   {description: "truthful_qa:model=ablation_full_functionality_text,task=mc_single,method=multiple_choice_separate_original", priority: 2, groups: ["ablation_multiple_choice"]}
   {description: "truthful_qa:model=ablation_full_functionality_text,task=mc_single,method=multiple_choice_separate_calibrated", priority: 2, groups: ["ablation_multiple_choice"]}
 
-  # MMLU priorities follow the main `run_specs.conf` with 2 -> 1 and 4 -> 3
+  # MMLU priorities follow the main `run_entries.conf` with 2 -> 1 and 4 -> 3
   {description: "mmlu:model=ablation_full_functionality_text,subject=abstract_algebra,method=multiple_choice_joint", priority: 1, groups: ["ablation_multiple_choice"]}
   {description: "mmlu:model=ablation_full_functionality_text,subject=abstract_algebra,method=multiple_choice_separate_original", priority: 1, groups: ["ablation_multiple_choice"]}
   {description: "mmlu:model=ablation_full_functionality_text,subject=abstract_algebra,method=multiple_choice_separate_calibrated", priority: 1, groups: ["ablation_multiple_choice"]}
@@ -357,14 +357,14 @@ entries: [
   # {description: "raft:subset=tai_safety_research,model=ablation_text,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
   # {description: "raft:subset=terms_of_service,model=ablation_text,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
 
-  # mmlu (only subjects with priority <= 2 in run_specs.conf)
+  # mmlu (only subjects with priority <= 2 in run_entries.conf)
   # {description: "mmlu:model=ablation_text,subject=abstract_algebra,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
   # {description: "mmlu:model=ablation_text,subject=college_chemistry,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
   # {description: "mmlu:model=ablation_text,subject=computer_security,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
   # {description: "mmlu:model=ablation_text,subject=econometrics,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
   # {description: "mmlu:model=ablation_text,subject=us_foreign_policy,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
 
-  # civil comments (only subjects with priority <= 2 in run_specs.conf)
+  # civil comments (only subjects with priority <= 2 in run_entries.conf)
   # {description: "civil_comments:model=ablation_text,demographic=all,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
   # {description: "civil_comments:model=ablation_text,demographic=male,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
   # {description: "civil_comments:model=ablation_text,demographic=female,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}

diff --git a/...benchmark/presentation/run_specs_gpu.conf → ...nchmark/presentation/run_entries_gpu.conf b/...benchmark/presentation/run_specs_gpu.conf → ...nchmark/presentation/run_entries_gpu.conf
diff --git a/...enchmark/presentation/run_specs_heim.conf → ...chmark/presentation/run_entries_heim.conf b/...enchmark/presentation/run_specs_heim.conf → ...chmark/presentation/run_entries_heim.conf
diff --git a/...rk/presentation/run_specs_heim_debug.conf → .../presentation/run_entries_heim_debug.conf b/...rk/presentation/run_specs_heim_debug.conf → .../presentation/run_entries_heim_debug.conf
diff --git a/...rk/presentation/run_specs_heim_human.conf → .../presentation/run_entries_heim_human.conf b/...rk/presentation/run_specs_heim_human.conf → .../presentation/run_entries_heim_human.conf
diff --git a/...esentation/run_specs_heim_human_eval.conf → ...entation/run_entries_heim_human_eval.conf b/...esentation/run_specs_heim_human_eval.conf → ...entation/run_entries_heim_human_eval.conf
diff --git a/...esentation/run_specs_image2structure.conf → ...entation/run_entries_image2structure.conf b/...esentation/run_specs_image2structure.conf → ...entation/run_entries_image2structure.conf
diff --git a/...resentation/run_specs_interactive_qa.conf → ...sentation/run_entries_interactive_qa.conf b/...resentation/run_specs_interactive_qa.conf → ...sentation/run_entries_interactive_qa.conf
@@ -1,6 +1,6 @@
 # MMLU subjects used for InteractiveQA. Run:
 # helm-run --priority 1 --suite interactive_qa_mmlu  --num-threads 1 --num-train-trials 3
-# --conf-path src/helm/benchmark/presentation/run_specs_interactive_qa.conf --max-eval-instances 10
+# --conf-path src/helm/benchmark/presentation/run_entries_interactive_qa.conf --max-eval-instances 10
 
 entries: [
   {description: "interactive_qa_mmlu:model=interactive_qa,subject=college_chemistry", priority: 1}

diff --git a/...enchmark/presentation/run_specs_lite.conf → ...chmark/presentation/run_entries_lite.conf b/...enchmark/presentation/run_specs_lite.conf → ...chmark/presentation/run_entries_lite.conf
diff --git a/...n/run_specs_opinions_qa_ai21_default.conf → ...run_entries_opinions_qa_ai21_default.conf b/...n/run_specs_opinions_qa_ai21_default.conf → ...run_entries_opinions_qa_ai21_default.conf
diff --git a/...ion/run_specs_opinions_qa_ai21_steer.conf → ...n/run_entries_opinions_qa_ai21_steer.conf b/...ion/run_specs_opinions_qa_ai21_steer.conf → ...n/run_entries_opinions_qa_ai21_steer.conf
diff --git a/...run_specs_opinions_qa_openai_default.conf → ...n_entries_opinions_qa_openai_default.conf b/...run_specs_opinions_qa_openai_default.conf → ...n_entries_opinions_qa_openai_default.conf
diff --git a/...n/run_specs_opinions_qa_openai_steer.conf → ...run_entries_opinions_qa_openai_steer.conf b/...n/run_specs_opinions_qa_openai_steer.conf → ...run_entries_opinions_qa_openai_steer.conf
diff --git a/...nchmark/presentation/run_specs_small.conf → ...hmark/presentation/run_entries_small.conf b/...nchmark/presentation/run_specs_small.conf → ...hmark/presentation/run_entries_small.conf
diff --git a/...enchmark/presentation/run_specs_tiny.conf → ...chmark/presentation/run_entries_tiny.conf b/...enchmark/presentation/run_specs_tiny.conf → ...chmark/presentation/run_entries_tiny.conf
diff --git a/...nchmark/presentation/run_specs_vhelm.conf → ...hmark/presentation/run_entries_vhelm.conf b/...nchmark/presentation/run_specs_vhelm.conf → ...hmark/presentation/run_entries_vhelm.conf
diff --git a/...rk/presentation/run_specs_vhelm_lite.conf → .../presentation/run_entries_vhelm_lite.conf b/...rk/presentation/run_specs_vhelm_lite.conf → .../presentation/run_entries_vhelm_lite.conf