Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize and colorize asm strings #2417

Merged
merged 64 commits into from
Jul 27, 2022
Merged

Tokenize and colorize asm strings #2417

merged 64 commits into from
Jul 27, 2022

Conversation

Rot127
Copy link
Member

@Rot127 Rot127 commented Mar 16, 2022

SQUASH ME

Your checklist for this pull request

  • I've read the guidelines for contributing to this repository
  • I made sure to follow the project's coding style
  • I've documented or updated the documentation of every function and struct this PR changes. If not so I've explained why.
  • I've added tests that prove my fix is effective or that my feature works (if possible)
  • I've updated the rizin book with the relevant information (if needed)

Detailed description

This is the first implementation for tokenization of asm strings. It adds:

  • A general way to split the colorless asm string into tokens.
  • Assign special types to each token (e.g. mnemonic, register, separator etc.)
  • Color these tokens according its type.
  • In case of number tokens: Store the numbers value in the token.

It also cleans up the coloring of the asm strings quite a bit. Additionally it possibly comes in handy in the future to have the asm string nicely split into parts.

Still missing:

  • Move token logic into RzAsm
  • Replace all usage of current coloring functions with the token based one.
    • vmenus.c::colorize_asm_string()
    • print.c::rz_print_colorize_opcode()
    • print.c::colorize_asm_string()
  • Do not colorize if it is turned off.
  • Replace token_val with strtoull().
    • 0 and 0x0 are not recognized as number.
  • Tests.
    • Generic tokenizer.
    • Hexagon custom tokenizer
    • Coloring generic
      • x86
      • arm
    • Coloring Hexagon custom
  • Allow to color only specific terms instead of all mnemonic tokens.
  • Some documentation is missing.
  • Allow custom coloring schemes.

Test plan

Added

Closing issues

closes rizinorg/rz-hexagon#9

Examples

Here are some (out of date) pictures to compare the current coloring vs. the token based coloring.
On the top is the token base coloring. Below that the current way.

Examples

x86

c-x86

arm

c-arm

mips

c-mips

dex

c-dex

hexagon

c-hexagon

librz/core/disasm.c Outdated Show resolved Hide resolved
librz/core/disasm.c Outdated Show resolved Hide resolved
librz/core/disasm.c Outdated Show resolved Hide resolved
librz/core/disasm.c Outdated Show resolved Hide resolved
@github-actions github-actions bot added the API label Mar 16, 2022
librz/core/disasm.c Outdated Show resolved Hide resolved
librz/core/disasm.c Outdated Show resolved Hide resolved
librz/core/disasm.c Outdated Show resolved Hide resolved
librz/include/rz_util/rz_print.h Outdated Show resolved Hide resolved
librz/include/rz_util/rz_print.h Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
Copy link
Member

@ret2libc ret2libc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you provide some unit tests for this new API before moving too ahead so that we can get an idea of how it is going to be used in practice?

librz/include/rz_util/rz_print.h Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
librz/include/rz_util/rz_print.h Outdated Show resolved Hide resolved
librz/include/rz_asm.h Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
librz/util/print.c Show resolved Hide resolved
librz/util/print.c Outdated Show resolved Hide resolved
test/unit/test_tokens.c Outdated Show resolved Hide resolved
test/unit/test_tokens.c Show resolved Hide resolved
Copy link
Member

@XVilka XVilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add also test for strings containing also more complex Unicode strings, see, for example, #2125

librz/util/unum.c Show resolved Hide resolved
librz/asm/arch/hexagon/hexagon.h Outdated Show resolved Hide resolved
librz/asm/arch/hexagon/hexagon.h Show resolved Hide resolved
librz/asm/arch/hexagon/hexagon.h Outdated Show resolved Hide resolved
librz/asm/arch/hexagon/hexagon_arch.h Show resolved Hide resolved
librz/asm/asm.c Outdated Show resolved Hide resolved
librz/asm/asm.c Outdated Show resolved Hide resolved
@Rot127 Rot127 mentioned this pull request Jul 2, 2022
@Rot127 Rot127 requested a review from XVilka July 2, 2022 18:52
The color mode needed to be set to COLOR_MODE_16 otherwise the default colors for red, green etc. on windows and linux differed to much.
@XVilka XVilka added the refactor Refactoring requests label Jul 25, 2022
Copy link
Member

@wargio wargio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lea hack is just horrible, please add a comment that says that it needs to be refactored and simplified

librz/include/rz_asm.h Outdated Show resolved Hide resolved
librz/include/rz_util/rz_print.h Outdated Show resolved Hide resolved
test/db/analysis/x86_64 Show resolved Hide resolved
test/db/cmd/cmd_ahi Show resolved Hide resolved
librz/include/rz_asm.h Show resolved Hide resolved
librz/include/rz_asm.h Show resolved Hide resolved
librz/asm/asm.c Show resolved Hide resolved
librz/asm/asm.c Show resolved Hide resolved
librz/asm/asm.c Outdated Show resolved Hide resolved
librz/asm/asm.c Show resolved Hide resolved
librz/asm/asm.c Outdated Show resolved Hide resolved
@Rot127
Copy link
Member Author

Rot127 commented Jul 26, 2022

@wargio

the lea hack is just horrible, please add a comment that says that it needs to be refactored and simplified

This is already addressed here: #2766

@ret2libc
Copy link
Member

image

Any chance those functions in the call X instructions can be colored? I find white to be a bit anonymous. My theme is "default".

Copy link
Member

@ret2libc ret2libc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from comment above, looks good to me, good work :)

@Rot127
Copy link
Member Author

Rot127 commented Jul 27, 2022

Please wait with merging. Just found a bug with the pi command.

@Rot127
Copy link
Member Author

Rot127 commented Jul 27, 2022

Any chance those functions in the call X instructions can be colored? I find white to be a bit anonymous. My theme is "default".

@ret2libc Couldn't find a quick fix for that. But the call addresses were replaced before the tokenization somewhere. Otherwise would be yellow, just like the other flag name at 0x4440. I could look into it next week and add it in another PR.

@ret2libc
Copy link
Member

Any chance those functions in the call X instructions can be colored? I find white to be a bit anonymous. My theme is "default".

@ret2libc Couldn't find a quick fix for that. But the call addresses were replaced before the tokenization somewhere. Otherwise would be yellow, just like the other flag name at 0x4440. I could look into it next week and add it in another PR.

I'm ok with it, thanks. I consider that a small regression but i think it's fine if fixed before next release.

Copy link
Member

@ret2libc ret2libc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rz_asm_token_pattern_builder pattern maybe is not really necessary and it can be introduced only once we see the code being repeated in multiple places. Right now there is only hexagon.

Comment on lines +38 to +44
pat = RZ_NEW0(RzAsmTokenPattern);
pat->type = RZ_ASM_TOKEN_META;
pat->pattern = strdup(
"(#{1,2})|(\\}$)|" // Immediate prefix, Closing packet bracket
"\\.new|:n?t|:raw|<err>" // .new and jump hints
);
rz_pvector_push(pvec, pat);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For next PR or a following one... I feel these operations are not really specific to hexagon. Maybe a sort of "builder API" would help here to avoid duplication.

Something like:

RzPVector *pvec = rz_asm_token_pattern_builder();
rz_asm_token_pattern_builder_add(RZ_ASM_TOKEN_META, ".....");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a great idea to me!

return pvec;
}

static void compile_token_patterns(RZ_INOUT RzPVector /* RzAsmTokenPattern* */ *patterns) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this function likely will be common to all?

rz_asm_token_pattern_builder_compile(pvec);

Comment on lines +45 to +90
typedef enum {
RZ_ASM_TOKEN_UNKNOWN = 0, //< Does not fit to any token below.
RZ_ASM_TOKEN_MNEMONIC, //< Asm mnemonics like: mov, push, lea...
RZ_ASM_TOKEN_OPERATOR, //< Arithmetic operators: +,-,<< etc.
RZ_ASM_TOKEN_NUMBER, //< Numbers
RZ_ASM_TOKEN_REGISTER, //< Registers
RZ_ASM_TOKEN_SEPARATOR, //< Brackets, comma etc.
RZ_ASM_TOKEN_META, //< Meta information (e.g Hexagon packet prefix, ARM & Hexagon number prefix).

RZ_ASM_TOKEN_LAST,
} RzAsmTokenType;

/**
* \brief A token of an asm string holding meta data.
*/
typedef struct {
size_t start; //< byte-offset into `str` where this token starts. Must be exactly at a utf-8 codepoint boundary.
size_t len; //< `str` length of token in bytes.
RzAsmTokenType type;
union {
ut64 number; //< Number of RZ_ASM_TOKEN_NUMBER
} val;
} RzAsmToken;

/**
* \brief An tokenized asm string.
*/
typedef struct {
ut32 op_type; ///< RzAnalysisOpType. Mnemonic color depends on this.
RzStrBuf *str; //< Contains the raw asm string
RzVector /* <RzAsmToken> */ *tokens; //< Contains only the tokenization meta-info without strings, ordered by start for log2(n) access
} RzAsmTokenString;

typedef struct {
const RzRegSet *reg_sets; ///< Array of reg sets used to lookup register names during parsing.
ut32 ana_op_type; ///< Analysis op type (see: _RzAnalysisOpType) of the token string to parse.
} RzAsmParseParam;

/**
* \brief Pattern for a asm string token.
*/
typedef struct {
RzAsmTokenType type; //< Asm token type.
char *pattern; //< The regex pattern describing the tokens.
RzRegex *regex; //< Compiled regex pattern.
} RzAsmTokenPattern;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these structs really be in RzPrint? They are also called RzAsmsomething so maybe they belong to RzAsm?

WDYT?

Copy link
Member Author

@Rot127 Rot127 Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that they should go to RzAsm modukle. But rz_print_colorize_asm_str needs the RzAsmTokenStr as type.
So I see

  • Give a const void * to rz_print_colorize_asm_str instead of RzAsmTokenString* (bit ugly)
  • or those structs blong to RzPrint (which doesn't fit that well).
    Or is there a third way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move rz_print_colorize_asm_str to RzCore?

@XVilka
Copy link
Member

XVilka commented Jul 27, 2022

I suggest opening an issue with suggestions/feedback here so it's not forgotten, we also can add more during the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Syntax highlighting is broken
5 participants