-
-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenize and colorize asm strings #2417
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you provide some unit tests for this new API before moving too ahead so that we can get an idea of how it is going to be used in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add also test for strings containing also more complex Unicode strings, see, for example, #2125
The color mode needed to be set to COLOR_MODE_16 otherwise the default colors for red, green etc. on windows and linux differed to much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lea hack is just horrible, please add a comment that says that it needs to be refactored and simplified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from comment above, looks good to me, good work :)
Please wait with merging. Just found a bug with the |
@ret2libc Couldn't find a quick fix for that. But the call addresses were replaced before the tokenization somewhere. Otherwise would be yellow, just like the other flag name at 0x4440. I could look into it next week and add it in another PR. |
I'm ok with it, thanks. I consider that a small regression but i think it's fine if fixed before next release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rz_asm_token_pattern_builder pattern maybe is not really necessary and it can be introduced only once we see the code being repeated in multiple places. Right now there is only hexagon.
pat = RZ_NEW0(RzAsmTokenPattern); | ||
pat->type = RZ_ASM_TOKEN_META; | ||
pat->pattern = strdup( | ||
"(#{1,2})|(\\}$)|" // Immediate prefix, Closing packet bracket | ||
"\\.new|:n?t|:raw|<err>" // .new and jump hints | ||
); | ||
rz_pvector_push(pvec, pat); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For next PR or a following one... I feel these operations are not really specific to hexagon. Maybe a sort of "builder API" would help here to avoid duplication.
Something like:
RzPVector *pvec = rz_asm_token_pattern_builder();
rz_asm_token_pattern_builder_add(RZ_ASM_TOKEN_META, ".....");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a great idea to me!
return pvec; | ||
} | ||
|
||
static void compile_token_patterns(RZ_INOUT RzPVector /* RzAsmTokenPattern* */ *patterns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this function likely will be common to all?
rz_asm_token_pattern_builder_compile(pvec);
typedef enum { | ||
RZ_ASM_TOKEN_UNKNOWN = 0, //< Does not fit to any token below. | ||
RZ_ASM_TOKEN_MNEMONIC, //< Asm mnemonics like: mov, push, lea... | ||
RZ_ASM_TOKEN_OPERATOR, //< Arithmetic operators: +,-,<< etc. | ||
RZ_ASM_TOKEN_NUMBER, //< Numbers | ||
RZ_ASM_TOKEN_REGISTER, //< Registers | ||
RZ_ASM_TOKEN_SEPARATOR, //< Brackets, comma etc. | ||
RZ_ASM_TOKEN_META, //< Meta information (e.g Hexagon packet prefix, ARM & Hexagon number prefix). | ||
|
||
RZ_ASM_TOKEN_LAST, | ||
} RzAsmTokenType; | ||
|
||
/** | ||
* \brief A token of an asm string holding meta data. | ||
*/ | ||
typedef struct { | ||
size_t start; //< byte-offset into `str` where this token starts. Must be exactly at a utf-8 codepoint boundary. | ||
size_t len; //< `str` length of token in bytes. | ||
RzAsmTokenType type; | ||
union { | ||
ut64 number; //< Number of RZ_ASM_TOKEN_NUMBER | ||
} val; | ||
} RzAsmToken; | ||
|
||
/** | ||
* \brief An tokenized asm string. | ||
*/ | ||
typedef struct { | ||
ut32 op_type; ///< RzAnalysisOpType. Mnemonic color depends on this. | ||
RzStrBuf *str; //< Contains the raw asm string | ||
RzVector /* <RzAsmToken> */ *tokens; //< Contains only the tokenization meta-info without strings, ordered by start for log2(n) access | ||
} RzAsmTokenString; | ||
|
||
typedef struct { | ||
const RzRegSet *reg_sets; ///< Array of reg sets used to lookup register names during parsing. | ||
ut32 ana_op_type; ///< Analysis op type (see: _RzAnalysisOpType) of the token string to parse. | ||
} RzAsmParseParam; | ||
|
||
/** | ||
* \brief Pattern for a asm string token. | ||
*/ | ||
typedef struct { | ||
RzAsmTokenType type; //< Asm token type. | ||
char *pattern; //< The regex pattern describing the tokens. | ||
RzRegex *regex; //< Compiled regex pattern. | ||
} RzAsmTokenPattern; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these structs really be in RzPrint? They are also called RzAsmsomething so maybe they belong to RzAsm?
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that they should go to RzAsm modukle. But rz_print_colorize_asm_str
needs the RzAsmTokenStr as type.
So I see
- Give a
const void *
torz_print_colorize_asm_str
instead ofRzAsmTokenString*
(bit ugly) - or those structs blong to
RzPrint
(which doesn't fit that well).
Or is there a third way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move rz_print_colorize_asm_str
to RzCore?
I suggest opening an issue with suggestions/feedback here so it's not forgotten, we also can add more during the process. |
SQUASH ME
Your checklist for this pull request
Detailed description
This is the first implementation for tokenization of asm strings. It adds:
It also cleans up the coloring of the asm strings quite a bit. Additionally it possibly comes in handy in the future to have the asm string nicely split into parts.
Still missing:
token_val
withstrtoull()
.0
and0x0
are not recognized as number.Test plan
Added
Closing issues
closes rizinorg/rz-hexagon#9
Examples
Here are some (out of date) pictures to compare the current coloring vs. the token based coloring.
On the top is the token base coloring. Below that the current way.
Examples
x86
arm
mips
dex
hexagon