Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename run_specs*.conf to run_entries*.conf #2430

Merged
merged 2 commits into from
Apr 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ directory exists.
The `helm-run` provides several flags that can be used to test that the configuration and scenario are working correctly without actually sending requests to the model

# Just load the config file
helm-run --conf src/helm/benchmark/presentation/run_specs_small.conf --max-eval-instances 10 --suite v1 --skip-instances
helm-run --conf src/helm/benchmark/presentation/run_entries_small.conf --max-eval-instances 10 --suite v1 --skip-instances

# Create the instances and the requests, but don't send requests to the model
helm-run --conf src/helm/benchmark/presentation/run_specs_small.conf --max-eval-instances 10 --suite v1 --dry-run
helm-run --conf src/helm/benchmark/presentation/run_entries_small.conf --max-eval-instances 10 --suite v1 --dry-run

## Estimating Token Usage

Expand Down
4 changes: 2 additions & 2 deletions docs/get_helm_rank.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,14 +44,14 @@ export MODEL_TO_RUN=huggingface/gpt2
That's it, run the following to get the config file:

```bash
wget https://github.com/stanford-crfm/helm/main/src/helm/benchmark/presentation/run_specs_core_scenarios_$EXAMPLES_PER_SCENARIO.conf -O run_specs_$EXAMPLES_PER_SCENARIO.conf
wget https://github.com/stanford-crfm/helm/main/src/helm/benchmark/presentation/run_entries_core_scenarios_$EXAMPLES_PER_SCENARIO.conf -O run_entries_$EXAMPLES_PER_SCENARIO.conf
```

and this one to run the benchmark (will take some time in the first time since all the data has to be prepared):

```bash
helm-run \
--conf-paths run_specs_$EXAMPLES_PER_SCENARIO.conf \
--conf-paths run_entries_$EXAMPLES_PER_SCENARIO.conf \
--suite $LEADERBOARD_VERSION \
--max-eval-instances $EXAMPLES_PER_SCENARIO \
--models-to-run $MODEL_TO_RUN \
Expand Down
6 changes: 3 additions & 3 deletions docs/heim.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ To run HEIM, follow these steps:
[Stable Diffusion v1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4) against the
[MS-COCO scenario](https://github.com/stanford-crfm/heim/blob/main/src/helm/benchmark/scenarios/image_generation/mscoco_scenario.py), run:
```
echo 'entries: [{description: "mscoco:model=huggingface/stable-diffusion-v1-4", priority: 1}]' > run_specs.conf
echo 'entries: [{description: "mscoco:model=huggingface/stable-diffusion-v1-4", priority: 1}]' > run_entries.conf
```
2. Run the benchmark with certain number of instances (e.g., 10 instances):
`helm-run --conf-paths run_specs.conf --suite heim_v1 --max-eval-instances 10`
`helm-run --conf-paths run_entries.conf --suite heim_v1 --max-eval-instances 10`

Examples of run specs configuration files can be found [here](https://github.com/stanford-crfm/helm/tree/main/src/helm/benchmark/presentation).
We used [this configuration file](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_specs_heim.conf)
We used [this configuration file](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_entries_heim.conf)
to produce results of the paper.
4 changes: 2 additions & 2 deletions docs/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ Run the following:

```
# Create a run specs configuration
echo 'entries: [{description: "mmlu:subject=philosophy,model=openai/gpt2", priority: 1}]' > run_specs.conf
echo 'entries: [{description: "mmlu:subject=philosophy,model=openai/gpt2", priority: 1}]' > run_entries.conf

# Run benchmark
helm-run --conf-paths run_specs.conf --suite v1 --max-eval-instances 10
helm-run --conf-paths run_entries.conf --suite v1 --max-eval-instances 10

# Summarize benchmark results
helm-summarize --suite v1
Expand Down
6 changes: 3 additions & 3 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ We will run two runs using the `mmlu` scenario on the `openai/gpt2` model. The `

To run this benchmark using the HELM command-line tools, we need to specify **run spec descriptions** that describes the desired runs. For this example, the run spec descriptions are `mmlu:subject=anatomy,model=openai/gpt2` (for anatomy) and `mmlu:subject=philosophy,model=openai/gpt2` (for philosophy).

Next, we need to create a **run spec configuration file** containing these run spec descriptions. A run spec configuration file is a text file containing `RunEntries` serialized to JSON, where each entry in `RunEntries` contains a run spec description. The `description` field of each entry should be a **run spec description**. Create a text file named `run_specs.conf` with the following contents:
Next, we need to create a **run spec configuration file** containing these run spec descriptions. A run spec configuration file is a text file containing `RunEntries` serialized to JSON, where each entry in `RunEntries` contains a run spec description. The `description` field of each entry should be a **run spec description**. Create a text file named `run_entries.conf` with the following contents:

```
entries: [
Expand All @@ -22,7 +22,7 @@ entries: [
We will now use `helm-run` to execute the runs that have been specified in this run spec configuration file. Run this command:

```
helm-run --conf-paths run_specs.conf --suite v1 --max-eval-instances 10
helm-run --conf-paths run_entries.conf --suite v1 --max-eval-instances 10
```

The meaning of the additional arguments are as follows:
Expand All @@ -45,7 +45,7 @@ Each output sub-directory will contain several JSON files that were generated du
- `per_instance_stats.json` contains a serialized list of `PerInstanceStats`, which contains the statistics produced for the metrics for each instance (i.e. input).
- `stats.json` contains a serialized list of `PerInstanceStats`, which contains the statistics produced for the metrics, aggregated across all instances (i.e. inputs).

`helm-run` provides additional arguments that can be used to filter out `--models-to-run`, `--groups-to-run` and `--priority`. It can be convenient to create a large `run_specs.conf` file containing every run spec description of interest, and then use these flags to filter down the RunSpecs to actually run. As an example, the main `run_specs.conf` file used for the HELM benchmarking paper can be found [here](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_specs.conf).
`helm-run` provides additional arguments that can be used to filter out `--models-to-run`, `--groups-to-run` and `--priority`. It can be convenient to create a large `run_entries.conf` file containing every run spec description of interest, and then use these flags to filter down the RunSpecs to actually run. As an example, the main `run_specs.conf` file used for the HELM benchmarking paper can be found [here](https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/presentation/run_specs.conf).

**Using model or model_deployment:** Some models have several deployments (for exmaple `eleutherai/gpt-j-6b` is deployed under `huggingface/gpt-j-6b`, `gooseai/gpt-j-6b` and `together/gpt-j-6b`). Since the results can differ depending on the deployment, we provide a way to specify the deployment instead of the model. Instead of using `model=eleutherai/gpt-g-6b`, use `model_deployment=huggingface/gpt-j-6b`. If you do not, a deployment will be arbitrarily chosen. This can still be used for models that have a single deployment and is a good practice to follow to avoid any ambiguity.

Expand Down
2 changes: 1 addition & 1 deletion scripts/helm-run-all.sh
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ do
logfile="${logfile// /_}" # Replace spaces

# Override with passed-in CLI arguments
# By default, the command will run the RunSpecs listed in src/helm/benchmark/presentation/run_specs.conf
# By default, the command will run the RunSpecs listed in src/helm/benchmark/presentation/run_entries.conf
# and output results to `benchmark_output/runs/<Today's date e.g., 06-28-2022>`.
execute "helm-run --models-to-run $model $* &> $logfile.log &"
done
2 changes: 1 addition & 1 deletion scripts/verify_reproducibility.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ def verify_reproducibility(
"--conf-path",
type=str,
help="Where to read RunSpecs to run from",
default="src/helm/benchmark/presentation/run_specs.conf",
default="src/helm/benchmark/presentation/run_entries.conf",
)
parser.add_argument(
"--models-to-run",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Biomedical RunSpecs
# helm-run --suite biomed --conf-path src/helm/benchmark/presentation/run_specs_biomedical.conf -m 1000
# helm-run --suite biomed --conf-path src/helm/benchmark/presentation/run_entries_biomedical.conf -m 1000

entries: [
######################################################### NLU ######################################################
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ entries: [
{description: "truthful_qa:model=ablation_full_functionality_text,task=mc_single,method=multiple_choice_separate_original", priority: 2, groups: ["ablation_multiple_choice"]}
{description: "truthful_qa:model=ablation_full_functionality_text,task=mc_single,method=multiple_choice_separate_calibrated", priority: 2, groups: ["ablation_multiple_choice"]}

# MMLU priorities follow the main `run_specs.conf` with 2 -> 1 and 4 -> 3
# MMLU priorities follow the main `run_entries.conf` with 2 -> 1 and 4 -> 3
{description: "mmlu:model=ablation_full_functionality_text,subject=abstract_algebra,method=multiple_choice_joint", priority: 1, groups: ["ablation_multiple_choice"]}
{description: "mmlu:model=ablation_full_functionality_text,subject=abstract_algebra,method=multiple_choice_separate_original", priority: 1, groups: ["ablation_multiple_choice"]}
{description: "mmlu:model=ablation_full_functionality_text,subject=abstract_algebra,method=multiple_choice_separate_calibrated", priority: 1, groups: ["ablation_multiple_choice"]}
Expand Down Expand Up @@ -357,14 +357,14 @@ entries: [
# {description: "raft:subset=tai_safety_research,model=ablation_text,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
# {description: "raft:subset=terms_of_service,model=ablation_text,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}

# mmlu (only subjects with priority <= 2 in run_specs.conf)
# mmlu (only subjects with priority <= 2 in run_entries.conf)
# {description: "mmlu:model=ablation_text,subject=abstract_algebra,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
# {description: "mmlu:model=ablation_text,subject=college_chemistry,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
# {description: "mmlu:model=ablation_text,subject=computer_security,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
# {description: "mmlu:model=ablation_text,subject=econometrics,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
# {description: "mmlu:model=ablation_text,subject=us_foreign_policy,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}

# civil comments (only subjects with priority <= 2 in run_specs.conf)
# civil comments (only subjects with priority <= 2 in run_entries.conf)
# {description: "civil_comments:model=ablation_text,demographic=all,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
# {description: "civil_comments:model=ablation_text,demographic=male,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
# {description: "civil_comments:model=ablation_text,demographic=female,data_augmentation=robustness_all", priority: 1, groups: ["robustness_individual"]}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MMLU subjects used for InteractiveQA. Run:
# helm-run --priority 1 --suite interactive_qa_mmlu --num-threads 1 --num-train-trials 3
# --conf-path src/helm/benchmark/presentation/run_specs_interactive_qa.conf --max-eval-instances 10
# --conf-path src/helm/benchmark/presentation/run_entries_interactive_qa.conf --max-eval-instances 10

entries: [
{description: "interactive_qa_mmlu:model=interactive_qa,subject=college_chemistry", priority: 1}
Expand Down