Skip to content

Improved extraction of function-level files (i.e. bytes, bytes-masked, func-cfg)#60

Open
valbucci wants to merge 36 commits into
br0kej:devfrom
valbucci:grouped-function-extraction
Open

Improved extraction of function-level files (i.e. bytes, bytes-masked, func-cfg)#60
valbucci wants to merge 36 commits into
br0kej:devfrom
valbucci:grouped-function-extraction

Conversation

@valbucci

Copy link
Copy Markdown
Contributor

Various changes improve the performance of bin2ml when extracting function-level information. These performance issues were barely noticeable on small files, but binaries with tens of thousands of functions would consume an unnecessary amount of time and resources. The improvements of this PR are as follows:

  • Single command execution at defined address.
    • Previously, if a command (e.g. afgj) was to be run at a given address (e.g. 0xdeadbeef) there would be two commands:
      • s 0xdeadbeef
      • agfj
    • This update merges these two commands into one such as agfj @ 3735928559
      • Note that the hexadecimal address 0xdeadbeef has been changed to integer 3735928559 to avoid unnecessary hex conversion, since addresses are stored as u64 values.
  • Single initialisation of the binary list of functions with get_function_list and setup_function_list
    • Previously, if multiple modes needed a list of functions, they would each run aflj.
    • Now this has been changed so that it is only executed once per file with setup_function_list.
    • To access the list of functions just use get_function_list and it will take care of running setup_function_list if necessary, otherwise, it will return the pre-existing list.
  • Since function file names may not be consistent (e.g. replaced invalid ASCII characters, truncated length, user-defined format), this PR introduces a function index inside the extracted folder.
    • For example, for the extraction mode bytes, the index would be stored in the path path/to/bin2ml_output/binary-name_bytes/00-func-index_bytes.csv.
    • The index contains the function name, address, byte size, number of instructions, number of basic blocks, and the path to the function's output data file.
    • This is meant to make it easier to track the location of function files, independently from the transformations applied to the file names.
  • Sometimes the user may want to only get data for functions of a certain size. Thus this PR introduces a filter for a minimum amount of blocks.
    • For example, a user may run bin2ml extract --min-basic-blocks 5 -f path/to/bin -o /path/to/bin2ml_output/ -m bytes to extract the bytes of all functions with at least 5 basic blocks.
    • This behaviour is disabled by default.

valbucci added 30 commits June 11, 2025 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant