Skip to content

[IR2Vec] Adding documentation for llvm-ir2vec tool #148719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: users/svkeerthy/07-09-ir2vec_tool_enhancements
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions llvm/docs/CommandGuide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ Basic Commands
llvm-dis
llvm-dwarfdump
llvm-dwarfutil
llvm-ir2vec
llvm-lib
llvm-libtool-darwin
llvm-link
Expand Down
170 changes: 170 additions & 0 deletions llvm/docs/CommandGuide/llvm-ir2vec.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
llvm-ir2vec - IR2Vec Embedding Generation Tool
==============================================

.. program:: llvm-ir2vec

SYNOPSIS
--------

:program:`llvm-ir2vec` [*options*] *input-file*

DESCRIPTION
-----------

:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
generates IR2Vec embeddings for LLVM IR and supports triplet generation
for vocabulary training. It provides two main operation modes:

1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
training from LLVM IR.

2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
at different granularity levels (instruction, basic block, or function).

The tool is designed to facilitate machine learning applications that work with
LLVM IR by converting the IR into numerical representations that can be used by
ML models.

.. note::

For information about using IR2Vec programmatically within LLVM passes and
the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
section in the MLGO documentation.

OPERATION MODES
---------------

Triplet Generation Mode
~~~~~~~~~~~~~~~~~~~~~~~

In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
consisting of opcodes, types, and operands. These triplets can be used to train
vocabularies for embedding generation.

Usage:

.. code-block:: bash

llvm-ir2vec --mode=triplets input.bc -o triplets.txt

Embedding Generation Mode
~~~~~~~~~~~~~~~~~~~~~~~~~~

In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
generate numerical embeddings for LLVM IR at different levels of granularity.

Example Usage:

.. code-block:: bash

llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt

OPTIONS
-------

.. option:: --mode=<mode>

Specify the operation mode. Valid values are:

* ``triplets`` - Generate triplets for vocabulary training
* ``embeddings`` - Generate embeddings using trained vocabulary (default)

.. option:: --level=<level>

Specify the embedding generation level. Valid values are:

* ``inst`` - Generate instruction-level embeddings
* ``bb`` - Generate basic block-level embeddings
* ``func`` - Generate function-level embeddings (default)

.. option:: --function=<name>

Process only the specified function instead of all functions in the module.

.. option:: --ir2vec-vocab-path=<path>

Specify the path to the vocabulary file (required for embedding mode).
The vocabulary file should be in JSON format and contain the trained
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
for pre-trained vocabulary files.

.. option:: --ir2vec-opc-weight=<weight>

Specify the weight for opcode embeddings (default: 1.0). This controls
the relative importance of instruction opcodes in the final embedding.

.. option:: --ir2vec-type-weight=<weight>

Specify the weight for type embeddings (default: 0.5). This controls
the relative importance of type information in the final embedding.

.. option:: --ir2vec-arg-weight=<weight>

Specify the weight for argument embeddings (default: 0.2). This controls
the relative importance of operand information in the final embedding.

.. option:: -o <filename>

Specify the output filename. Use ``-`` to write to standard output (default).

.. option:: --help

Print a summary of command line options.

.. note::

``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``,
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding
mode. These options are ignored in triplet mode.

INPUT FILE FORMAT
-----------------

:program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files
(``.ll``) as input. The input file should contain valid LLVM IR.

OUTPUT FORMAT
-------------

Triplet Mode Output
~~~~~~~~~~~~~~~~~~~

In triplet mode, the output consists of lines containing space-separated triplets:

.. code-block:: text

<opcode> <type> <operand1> <operand2> ...

Each line represents the information of one instruction, with the opcode, type,
and operands.

Embedding Mode Output
~~~~~~~~~~~~~~~~~~~~~

In embedding mode, the output format depends on the specified level:

* **Function Level**: One embedding vector per function
* **Basic Block Level**: One embedding vector per basic block, grouped by function
* **Instruction Level**: One embedding vector per instruction, grouped by basic block and function

Each embedding is represented as a floating point vector.

EXIT STATUS
-----------

:program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure.

Common failure cases include:

* Invalid or missing input file
* Missing or invalid vocabulary file (in embedding mode)
* Specified function not found in the module
* Invalid command line options

SEE ALSO
--------

:doc:`../MLGO`

For more information about the IR2Vec algorithm and approach, see:
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
12 changes: 11 additions & 1 deletion llvm/docs/MLGO.rst
Original file line number Diff line number Diff line change
Expand Up @@ -468,6 +468,13 @@ The core components are:
Using IR2Vec
------------

.. note::

This section describes how to use IR2Vec within LLVM passes. A standalone
tool :doc:`CommandGuide/llvm-ir2vec` is available for generating the
embeddings and triplets from LLVM IR files, which can be useful for
training vocabularies and generating embeddings outside of compiler passes.

For generating embeddings, first the vocabulary should be obtained. Then, the
embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance.

Expand Down Expand Up @@ -524,6 +531,10 @@ Further Details
For more detailed information about the IR2Vec algorithm, its parameters, and
advanced usage, please refer to the original paper:
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.

For information about using IR2Vec tool for generating embeddings and
triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`.

The LLVM source code for ``IR2Vec`` can also be explored to understand the
implementation details.

Expand Down Expand Up @@ -595,4 +606,3 @@ optimizations that are currently MLGO-enabled, it may be used as follows:
where the ``name`` is a path fragment. We will expect to find 2 files,
``<name>.in`` (readable, data incoming from the managing process) and
``<name>.out`` (writable, the model runner sends data to the managing process)

8 changes: 2 additions & 6 deletions llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -276,12 +276,8 @@ int main(int argc, char **argv) {
"Generates embeddings for a given LLVM IR and "
"supports triplet generation for vocabulary "
"training and embedding generation.\n\n"
"Usage:\n"
" Triplet mode: llvm-ir2vec --mode=triplets input.bc\n"
" Embedding mode: llvm-ir2vec --mode=embeddings "
"--ir2vec-vocab-path=vocab.json --level=func input.bc\n"
" Levels: --level=inst (instructions), --level=bb (basic blocks), "
"--level=func (functions)\n");
"See https://llvm.org/docs/CommandGuide/llvm-ir2vec.html for more "
"information.\n");

// Validate command line options
if (Mode == TripletMode && Level.getNumOccurrences() > 0)
Expand Down