Skip to content

Commit 5490291

Browse files
committed
[IR2Vec] Adding documentation for llvm-ir2vec tool
1 parent a3b518b commit 5490291

File tree

4 files changed

+184
-7
lines changed

4 files changed

+184
-7
lines changed

llvm/docs/CommandGuide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ Basic Commands
2727
llvm-dis
2828
llvm-dwarfdump
2929
llvm-dwarfutil
30+
llvm-ir2vec
3031
llvm-lib
3132
llvm-libtool-darwin
3233
llvm-link
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
llvm-ir2vec - IR2Vec Embedding Generation Tool
2+
==============================================
3+
4+
.. program:: llvm-ir2vec
5+
6+
SYNOPSIS
7+
--------
8+
9+
:program:`llvm-ir2vec` [*options*] *input-file*
10+
11+
DESCRIPTION
12+
-----------
13+
14+
:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
15+
generates IR2Vec embeddings for LLVM IR and supports triplet generation
16+
for vocabulary training. It provides two main operation modes:
17+
18+
1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
19+
training from LLVM IR.
20+
21+
2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
22+
at different granularity levels (instruction, basic block, or function).
23+
24+
The tool is designed to facilitate machine learning applications that work with
25+
LLVM IR by converting the IR into numerical representations that can be used by
26+
ML models.
27+
28+
.. note::
29+
30+
For information about using IR2Vec programmatically within LLVM passes and
31+
the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
32+
section in the MLGO documentation.
33+
34+
OPERATION MODES
35+
---------------
36+
37+
Triplet Generation Mode
38+
~~~~~~~~~~~~~~~~~~~~~~~
39+
40+
In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
41+
consisting of opcodes, types, and operands. These triplets can be used to train
42+
vocabularies for embedding generation.
43+
44+
Usage:
45+
46+
.. code-block:: bash
47+
48+
llvm-ir2vec --mode=triplets input.bc -o triplets.txt
49+
50+
Embedding Generation Mode
51+
~~~~~~~~~~~~~~~~~~~~~~~~~~
52+
53+
In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
54+
generate numerical embeddings for LLVM IR at different levels of granularity.
55+
56+
Example Usage:
57+
58+
.. code-block:: bash
59+
60+
llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
61+
62+
OPTIONS
63+
-------
64+
65+
.. option:: --mode=<mode>
66+
67+
Specify the operation mode. Valid values are:
68+
69+
* ``triplets`` - Generate triplets for vocabulary training
70+
* ``embeddings`` - Generate embeddings using trained vocabulary (default)
71+
72+
.. option:: --level=<level>
73+
74+
Specify the embedding generation level. Valid values are:
75+
76+
* ``inst`` - Generate instruction-level embeddings
77+
* ``bb`` - Generate basic block-level embeddings
78+
* ``func`` - Generate function-level embeddings (default)
79+
80+
.. option:: --function=<name>
81+
82+
Process only the specified function instead of all functions in the module.
83+
84+
.. option:: --ir2vec-vocab-path=<path>
85+
86+
Specify the path to the vocabulary file (required for embedding mode).
87+
The vocabulary file should be in JSON format and contain the trained
88+
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
89+
for pre-trained vocabulary files.
90+
91+
.. option:: --ir2vec-opc-weight=<weight>
92+
93+
Specify the weight for opcode embeddings (default: 1.0). This controls
94+
the relative importance of instruction opcodes in the final embedding.
95+
96+
.. option:: --ir2vec-type-weight=<weight>
97+
98+
Specify the weight for type embeddings (default: 0.5). This controls
99+
the relative importance of type information in the final embedding.
100+
101+
.. option:: --ir2vec-arg-weight=<weight>
102+
103+
Specify the weight for argument embeddings (default: 0.2). This controls
104+
the relative importance of operand information in the final embedding.
105+
106+
.. option:: -o <filename>
107+
108+
Specify the output filename. Use ``-`` to write to standard output (default).
109+
110+
.. option:: --help
111+
112+
Print a summary of command line options.
113+
114+
.. note::
115+
116+
``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``,
117+
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding
118+
mode. These options are ignored in triplet mode.
119+
120+
INPUT FILE FORMAT
121+
-----------------
122+
123+
:program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files
124+
(``.ll``) as input. The input file should contain valid LLVM IR.
125+
126+
OUTPUT FORMAT
127+
-------------
128+
129+
Triplet Mode Output
130+
~~~~~~~~~~~~~~~~~~~
131+
132+
In triplet mode, the output consists of lines containing space-separated triplets:
133+
134+
.. code-block:: text
135+
136+
<opcode> <type> <operand1> <operand2> ...
137+
138+
Each line represents the information of one instruction, with the opcode, type,
139+
and operands.
140+
141+
Embedding Mode Output
142+
~~~~~~~~~~~~~~~~~~~~~
143+
144+
In embedding mode, the output format depends on the specified level:
145+
146+
* **Function Level**: One embedding vector per function
147+
* **Basic Block Level**: One embedding vector per basic block, grouped by function
148+
* **Instruction Level**: One embedding vector per instruction, grouped by basic block and function
149+
150+
Each embedding is represented as a floating point vector.
151+
152+
EXIT STATUS
153+
-----------
154+
155+
:program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure.
156+
157+
Common failure cases include:
158+
159+
* Invalid or missing input file
160+
* Missing or invalid vocabulary file (in embedding mode)
161+
* Specified function not found in the module
162+
* Invalid command line options
163+
164+
SEE ALSO
165+
--------
166+
167+
:doc:`../MLGO`
168+
169+
For more information about the IR2Vec algorithm and approach, see:
170+
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.

llvm/docs/MLGO.rst

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -468,6 +468,13 @@ The core components are:
468468
Using IR2Vec
469469
------------
470470

471+
.. note::
472+
473+
This section describes how to use IR2Vec within LLVM passes. A standalone
474+
tool :doc:`CommandGuide/llvm-ir2vec` is available for generating the
475+
embeddings and triplets from LLVM IR files, which can be useful for
476+
training vocabularies and generating embeddings outside of compiler passes.
477+
471478
For generating embeddings, first the vocabulary should be obtained. Then, the
472479
embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance.
473480

@@ -524,6 +531,10 @@ Further Details
524531
For more detailed information about the IR2Vec algorithm, its parameters, and
525532
advanced usage, please refer to the original paper:
526533
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
534+
535+
For information about using IR2Vec tool for generating embeddings and
536+
triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`.
537+
527538
The LLVM source code for ``IR2Vec`` can also be explored to understand the
528539
implementation details.
529540

@@ -595,4 +606,3 @@ optimizations that are currently MLGO-enabled, it may be used as follows:
595606
where the ``name`` is a path fragment. We will expect to find 2 files,
596607
``<name>.in`` (readable, data incoming from the managing process) and
597608
``<name>.out`` (writable, the model runner sends data to the managing process)
598-

llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -276,12 +276,8 @@ int main(int argc, char **argv) {
276276
"Generates embeddings for a given LLVM IR and "
277277
"supports triplet generation for vocabulary "
278278
"training and embedding generation.\n\n"
279-
"Usage:\n"
280-
" Triplet mode: llvm-ir2vec --mode=triplets input.bc\n"
281-
" Embedding mode: llvm-ir2vec --mode=embeddings "
282-
"--ir2vec-vocab-path=vocab.json --level=func input.bc\n"
283-
" Levels: --level=inst (instructions), --level=bb (basic blocks), "
284-
"--level=func (functions)\n");
279+
"See https://llvm.org/docs/CommandGuide/llvm-ir2vec.html for more "
280+
"information.\n");
285281

286282
// Validate command line options
287283
if (Mode == TripletMode && Level.getNumOccurrences() > 0)

0 commit comments

Comments
 (0)