Skip to content

Function file name sanitisation#49

Open
valbucci wants to merge 5 commits into
br0kej:devfrom
valbucci:func-file-names
Open

Function file name sanitisation#49
valbucci wants to merge 5 commits into
br0kej:devfrom
valbucci:func-file-names

Conversation

@valbucci

@valbucci valbucci commented Jun 12, 2025

Copy link
Copy Markdown
Contributor

File name sanitization

There used to is a bug described in issue #45 where if a function's name is too long the data extraction and generation will fail.

This fix checks if file names are longer than 100 characters and, if so, they are truncated such that the leading+trailing 50 characters are retained to avoid duplication, and the ones in the middle are removed. For example, a function named:

  • sym.slices.siftDownCmpFunc_go.shape.interface__Info____io_fs.FileInfo__error___IsDir___bo ol__Name___string__Type___io_fs.FileMode (129 characters)
  • is renamed to ...
  • sym.slices.siftDownCmpFunc_go.shape.interface__InfIsDir___bool__Name___string__Type___io_fs.FileMode

There is a function called sanitize_filename (see src/utils.rs:L95), which replaces all invalid characters with _ and performs the bidirectional truncation described above.

In the extract mode I added an option --func-filename with allows options "symbol" (default), "address", or a custom template. This only works with the bytes mode since it's the only one that extracts function-level data files.

For example, a function called main0 at address 0xdeadbeef will be extracted with the following file names, depending on the specified option:

OptionFilename
symbolmain0.bin
addressdeadbeef.bin
{address}.{symbol}deadbeef.main0
func-{symbol}.binfunc-main0.bin

It would be nice to have this functionality in bin2ml generate as well, but it would entail a performance reduction during the feature extraction process -- e.g. it would be necessary to run multiple commands for each function instead of agCj @@f.

Secondary changes

I also made some other secondary changes.

  • I added FunctionToBeProcessed (see extract.rs:L706) to hold all the function-related logic such as getters and the byte extraction script.
  • I updated extraction_job_matcher and get_job_type_suffix such that they are based on a single source of truth (see HashMap in extract.rs:L60)
  • I changed some struct data types from signed integers to unsigned integers where appropriate -- function size, or number of edges.

@valbucci valbucci changed the title Function file name sanitisation (issue #45) Function file name sanitisation Jun 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant