Skip to content

Latest commit

 

History

History
171 lines (145 loc) · 6.5 KB

custom-scripts.md

File metadata and controls

171 lines (145 loc) · 6.5 KB

Custom scripts and projects

The project.yml lets you define any custom commands and run them as part of your training, evaluation or deployment workflows. The script section defines a list of commands that are called in a subprocess, in order. This lets you execute other Python scripts or command-line tools.

Let's say you're training a spaCy pipeline, and you've written a few integration tests that load the best model produced by the training command and check that it works correctly. You can now define a test command that calls into pytest, runs your tests and uses pytest-html to export a test report:

💡 Example configuration

commands:
  - name: test
    help: 'Test the trained pipeline'
    script:
      - 'pip install pytest pytest-html'
      - 'python -m pytest ./scripts/tests --html=metrics/test-report.html'
    deps:
      - 'training/model-best'
    outputs:
      - 'metrics/test-report.html'
    no_skip: true

Adding training/model-best to the command's deps lets you ensure that the file is available. If not, Weasel will show an error and the command won't run. Setting no_skip: true means that the command will always run, even if the dependencies (the trained pipeline) haven't changed. This makes sense here, because you typically don't want to skip your tests.

Writing custom scripts

Your project commands can include any custom scripts – essentially, anything you can run from the command line. Here's an example of a custom script that uses typer for quick and easy command-line arguments that you can define via your project.yml:

import typer

def custom_evaluation(batch_size: int = 128, model_path: str, data_path: str):
    # The arguments are now available as positional CLI arguments
    print(batch_size, model_path, data_path)

if __name__ == "__main__":
    typer.run(custom_evaluation)

ℹ️ About Typer

typer is a modern library for building Python CLIs using type hints. It's a dependency of Weasel, so it will already be pre-installed in your environment. Function arguments automatically become positional CLI arguments and using Python type hints, you can define the value types. For instance, batch_size: int means that the value provided via the command line is converted to an integer.

In your project.yml, you can then run the script by calling python scripts/custom_evaluation.py with the function arguments. You can also use the vars section to define reusable variables that will be substituted in commands, paths and URLs. In the following example, the batch size is defined as a variable will be added in place of ${vars.batch_size} in the script.

💡 Example usage of vars

vars:
 batch_size: 128

commands:
  - name: evaluate
    script:
      - 'python scripts/custom_evaluation.py ${vars.batch_size} ./training/model-best ./corpus/eval.json'
    deps:
      - 'training/model-best'
      - 'corpus/eval.json'

ℹ️ Calling into Python

If any of your command scripts call into python, Weasel will take care of replacing that with your sys.executable, to make sure you're executing everything with the same Python (not some other Python installed on your system). It also normalizes references to python3, pip3 and pip.

You can also use the env section to reference environment variables and make their values available to the commands. This can be useful for overriding settings on the command line and passing through system-level settings.

💡 Example usage of EnvVars

export GPU_ID=1
BATCH_SIZE=128 python -m weasel run evaluate
env:
  batch_size: BATCH_SIZE
  gpu_id: GPU_ID

commands:
  - name: evaluate
    script:
      - 'python scripts/custom_evaluation.py ${env.batch_size}'

Documenting your project

💡 Examples

For more examples, see the projects repo.

Screenshot of auto-generated Markdown Readme

When your custom project is ready and you want to share it with others, you can use the weasel document command to auto-generate a pretty, Markdown-formatted README file based on your project's project.yml. It will list all commands, workflows and assets defined in the project and include details on how to run the project, as well as links to the relevant Weasel documentation to make it easy for others to get started using your project.

python -m weasel document --output README.md

Under the hood, hidden markers are added to identify where the auto-generated content starts and ends. This means that you can add your own custom content before or after it and re-running the document command will only update the auto-generated part. This makes it easy to keep your documentation up to date.

Warning

Note that the contents of an existing file will be replaced if no existing auto-generated docs are found. If you want Weasel to ignore a file and not update it, you can add the comment marker {/* WEASEL: IGNORE */} anywhere in your markup.

Cloning from your own repo

The weasel clone command lets you customize the repo to clone from using the --repo option. It calls into git, so you'll be able to clone from any repo that you have access to, including private repos.

python -m weasel clone your_project --repo https://github.com/you/repo

At a minimum, a valid project template needs to contain a project.yml. It can also include other files, like custom scripts, a requirements.txt listing additional dependencies, a machine learning model and meta templates, or Jupyter notebooks with usage examples.

⚠️ Important note about assets

It's typically not a good idea to check large data assets, trained pipelines or other artifacts into a Git repo and you should exclude them from your project template by adding a .gitignore. If you want to version your data and models, check out Data Version Control (DVC), which integrates with Weasek.