[Question] A Question Regarding the Handling of Special Tokens

First and foremost, thank you for all the excellent work on this project! The code seems very elegant and well-structured.

I have a question about a specific detail, and I apologize in advance if it's something I have overlooked.

Throughout the process, we make use of four special tokens:
- `<|begin_of_solution|>`
- `<|end_of_solution|>`
- `<|begin_of_explanation|>`
- `<|end_of_explanation|>`

My question is whether these have been registered as "special tokens" with the tokenizer. In other words, is each one parsed as a single, unique token?

I noticed that when using a tokenizer, such as from the Qwen2.5 family, these tokens appear to be broken down into multiple sub-tokens. For example, `<|begin_of_explanation|>` might be tokenized into something like this:

```
'<'                  
'|'                  
'begin'              
'_of'                
'_ex'                
'planation'          
'|'                  
'>\n\n'              
```

I understand that even with this approach, the model can certainly learn to recognize this sequence of sub-tokens as a coherent pattern.

However, I would be grateful if you could share your thoughts on this implementation detail. Was this multi-token representation an intentional design choice, or have these tokens already been registered as single special tokens and I may have missed that step?

Thank you for your time and for any insight you can provide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] A Question Regarding the Handling of Special Tokens #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Question] A Question Regarding the Handling of Special Tokens #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions