First and foremost, thank you for all the excellent work on this project! The code seems very elegant and well-structured.
I have a question about a specific detail, and I apologize in advance if it's something I have overlooked.
Throughout the process, we make use of four special tokens:
<|begin_of_solution|>
<|end_of_solution|>
<|begin_of_explanation|>
<|end_of_explanation|>
My question is whether these have been registered as "special tokens" with the tokenizer. In other words, is each one parsed as a single, unique token?
I noticed that when using a tokenizer, such as from the Qwen2.5 family, these tokens appear to be broken down into multiple sub-tokens. For example, <|begin_of_explanation|> might be tokenized into something like this:
'<'
'|'
'begin'
'_of'
'_ex'
'planation'
'|'
'>\n\n'
I understand that even with this approach, the model can certainly learn to recognize this sequence of sub-tokens as a coherent pattern.
However, I would be grateful if you could share your thoughts on this implementation detail. Was this multi-token representation an intentional design choice, or have these tokens already been registered as single special tokens and I may have missed that step?
Thank you for your time and for any insight you can provide.
First and foremost, thank you for all the excellent work on this project! The code seems very elegant and well-structured.
I have a question about a specific detail, and I apologize in advance if it's something I have overlooked.
Throughout the process, we make use of four special tokens:
<|begin_of_solution|><|end_of_solution|><|begin_of_explanation|><|end_of_explanation|>My question is whether these have been registered as "special tokens" with the tokenizer. In other words, is each one parsed as a single, unique token?
I noticed that when using a tokenizer, such as from the Qwen2.5 family, these tokens appear to be broken down into multiple sub-tokens. For example,
<|begin_of_explanation|>might be tokenized into something like this:I understand that even with this approach, the model can certainly learn to recognize this sequence of sub-tokens as a coherent pattern.
However, I would be grateful if you could share your thoughts on this implementation detail. Was this multi-token representation an intentional design choice, or have these tokens already been registered as single special tokens and I may have missed that step?
Thank you for your time and for any insight you can provide.