Add multisymbol RLE implementation#1052
Conversation
708cfaa to
7d9f08b
Compare
|
can you rebase? |
|
@proppy I've rebased it on to the newest main |
Can you detail a little a bit more in the PR description (or the |
|
I'll update |
Curious why there is three transaction here, and two separate packets for |
|
|
||
| // This encoder is implemented as a net of 4 processes. | ||
| // 1. Reduce stage - this process takes incoming symbols and symbol_valid | ||
| // and reduces them into symbol count pairs. This stage is stateless. |
There was a problem hiding this comment.
if the stage is stateless, does it need to be a proc?
There was a problem hiding this comment.
As far as I understand DSLX if I want to perform channel operations I have to use proc's.
| // This encoder is implemented as a net of 4 processes. | ||
| // 1. Reduce stage - this process takes incoming symbols and symbol_valid | ||
| // and reduces them into symbol count pairs. This stage is stateless. | ||
| // 2. Realign stage - this process moves pairs emitted from previous stage |
There was a problem hiding this comment.
can you add a note describing the state for each "stage" when they have one?
|
|
||
| // This encoder is implemented as a net of 4 processes. | ||
| // 1. Reduce stage - this process takes incoming symbols and symbol_valid | ||
| // and reduces them into symbol count pairs. This stage is stateless. |
There was a problem hiding this comment.
isn't this stage equivalent to the current rle_enc can the implementation be shared?
There was a problem hiding this comment.
I can't see any obvious code that can be shared between reduce step and rle_enc.
Main difference is that rle_enc code has to take into account state and update it f necessary,
while reduce step doesn't have one.
| // 1. Reduce stage - this process takes incoming symbols and symbol_valid | ||
| // and reduces them into symbol count pairs. This stage is stateless. | ||
| // 2. Realign stage - this process moves pairs emitted from previous stage | ||
| // so that they are align to the left, it also calculates propagation distance |
There was a problem hiding this comment.
can you give an example to show 'align to the left` refer to?
There was a problem hiding this comment.
Sure, let the input to 1 step be [A, A, B, B].
Output from the first step will be [..., (A,2), ..., (B,2)].
2 step will transform it to [(A,2), (B,2), ..., ...].
There was a problem hiding this comment.
see the comment below can you also explain what ... refers to?
There was a problem hiding this comment.
Here ... was used for an invalid data symbol, counter pair.
| // 1. Reduce stage - this process takes incoming symbols and symbol_valid | ||
| // and reduces them into symbol count pairs. This stage is stateless. | ||
| // 2. Realign stage - this process moves pairs emitted from previous stage | ||
| // so that they are align to the left, it also calculates propagation distance |
There was a problem hiding this comment.
can you give an example of a propagation distance?
There was a problem hiding this comment.
Let's assume that count is 2 bit wide and input can ingest 4 symbols at once.
Let the input be [A,A,A,A], so after the second stage it will look like:
[(A,3), (A,1), ..., ...] and propagation distance will be 1.
For input [A,B,B,B] second step will emit [(A,1), (B,3), ..., ...] with propagation distance 0.
There was a problem hiding this comment.
see the comment below, can you update the comment with a textual explanation of what the propagation distance refer to? (here is seems that it mainly indicate with a packet has been broken or not)
| // 2. Realign stage - this process moves pairs emitted from previous stage | ||
| // so that they are align to the left, it also calculates propagation distance | ||
| // for the first pair. | ||
| // 3. Core stage - this stage is stateful. It takes align pairs, |
There was a problem hiding this comment.
can you document what the state is?
| // so that they are align to the left, it also calculates propagation distance | ||
| // for the first pair. | ||
| // 3. Core stage - this stage is stateful. It takes align pairs, | ||
| // and combines them with its state.It outputs multiple symbol/count pairs. |
There was a problem hiding this comment.
can you give an example of the output to show how it different from the received input?
| // for the first pair. | ||
| // 3. Core stage - this stage is stateful. It takes align pairs, | ||
| // and combines them with its state.It outputs multiple symbol/count pairs. | ||
| // 4 - Adjust Width stage - this stage takes output from the core stage. |
| // This encoder is implemented as a net of 4 processes. | ||
| // 1. Reduce stage - this process takes incoming symbols and symbol_valid | ||
| // and reduces them into symbol count pairs. This stage is stateless. | ||
| // 2. Realign stage - this process moves pairs emitted from previous stage |
There was a problem hiding this comment.
does it takes input from the preceding "stage"? if yes, maybe we should state it explicitly.
| // 2. Realign stage - this process moves pairs emitted from previous stage | ||
| // so that they are align to the left, it also calculates propagation distance | ||
| // for the first pair. | ||
| // 3. Core stage - this stage is stateful. It takes align pairs, |
There was a problem hiding this comment.
does it takes input from the preceding "stage"? if yes, maybe we should state it explicitly.
| // 3. Core stage - this stage is stateful. It takes align pairs, | ||
| // and combines them with its state.It outputs multiple symbol/count pairs. | ||
| // 4 - Adjust Width stage - this stage takes output from the core stage. | ||
| // If output can handle more or equal number of pairs as |
There was a problem hiding this comment.
How does the "stage" know that the "output can handle more or equal the number of pairs", can you give an example
There was a problem hiding this comment.
Output and input widths are parametrized INPUT_WIDTH and OUTPUT_WIDTH.
If OUTPUT_WIDTH is greater or equal to INPUT_WIDTH, adjust width step has nothing to adjust.
| // If output can handle more or equal number of pairs as | ||
| // input number of symbols. This stage does nothing. | ||
| // If the output is narrower than the input, | ||
| // this stage will serialize symbol counter pairs. |
There was a problem hiding this comment.
this stage will serialize symbol counter pairs.
as it will split them to match the size of the desired output? can you give an example?
There was a problem hiding this comment.
Let input width be 4 and output with is 2. Core step can emit the following : [(A,1), (B,1) , (A,1), (B,1)].
Adjust Width step will serialize it to [(A,1(, (B,1)], [(A,1), (B,1)]
This was with an assumption that RLE counter is 2 bit wide, so it can only hold at most 3 symbols, while C was repeated 4 times. |
c9daa7d to
5a7ead3
Compare
6bb2500 to
981283c
Compare
| pub proc RunLengthDecoder<SYMBOL_WIDTH: u32, COUNT_WIDTH: u32> { | ||
| input_r: chan<DecInData<SYMBOL_WIDTH, COUNT_WIDTH>> in; | ||
| output_s: chan<DecOutData<SYMBOL_WIDTH>> out; | ||
| input_r: chan<DecInData<SYMBOL_WIDTH, COUNT_WIDTH, 1>> in; |
There was a problem hiding this comment.
I wonder if it would be more friendly to parameterize with max_count and use https://google.github.io/xls/dslx_std/#stdclog2 + 1 to represent the width?
There was a problem hiding this comment.
As far as I remember there are parts of code that assume max_count to be 2^k-1.
I'm not sure when user would like to set max_count to be something other than power of 2 -1.
| // The behavior of the encoder is presented on the waveform below: | ||
|
|
||
| // This encoder is implemented as a net of 4 proc. | ||
| // 1. Reduce proc - this process takes incoming symbols and symbol_valid |
There was a problem hiding this comment.
is 1. effectively equivalent to the existing RLE enc implementation?
There was a problem hiding this comment.
It's not exactly the same thing as existing RLE enc. Existing RLE has state that is combined with incoming data,
this stage doesn't have state, it only works on data visible in a single channel payload and doesn't include stream history.
| // distance for the first pair. | ||
| // Example behaviours: | ||
| // 1) | ||
| // input: [.., (A, 2), .., (B, 2)] |
There was a problem hiding this comment.
It was to mark invalid data and data, count pairs.
| // and reduces them into symbol count pairs. This step is stateless. | ||
| // 2. Realign proc - this process moves pairs emitted from the reduce step | ||
| // so that they are aligned to the left, it also calculates propagation | ||
| // distance for the first pair. |
There was a problem hiding this comment.
can you also explain what "propagation distance" mean here? (it's not clear, at least to me, from just reading the example)
There was a problem hiding this comment.
Here propagation distance is the number of consecutive pairs, starting form first pair, that have the same symbol value as the first one and have their counter maxed out. This way 'Core proc' can only check "propagation distance"th pair value and counter.
- If value differs from the internal state, then all pairs from [0:"propagation distance") will differ and core should combine payload with its state.
- If value is equal internal state, then "Core proc" only checks if internal state counter + "propagation distance"th pair counter overflow.
| // output: [(A, 2), (B, 2), .., ..] | ||
| // propagation distance: 0 | ||
| // 2) | ||
| // input: [.., .., (A, 3), (A, 1)] |
There was a problem hiding this comment.
(A, 3), (A, 1) can this actually happen? wouldn't the previous step simply emit (A, 4)?
There was a problem hiding this comment.
In the examples I was assuming counter 2 bit wide, so that's why there is (A, 3), (A, 1) and not (A, 4).
| // 3. Core proc - this step is stateful. It takes align pairs from | ||
| // the realign step, and combines them with its state to create multiple | ||
| // symbol/count pairs output. State is represented by following tuple | ||
| // `<symbol, count, last>`. It contains symbol and count from last pair |
There was a problem hiding this comment.
does it effectively means that we're counting again? is that right to assume that if each transaction only include one symbol this does nothing?
There was a problem hiding this comment.
This is the first proc with a state that holds stream history. If a stream would only have one data value in all symbols.
First processing state would reduce incoming data stream into (data_value, input_count), this stage would combine this pair with one that preceded it, and only emit output pair when either internal counter overflows or stream ends.
| ||| | ||
| |-----|-------| | ||
| |input|output | | ||
| |[(A, True), (A, True), (A, True), (A, True)]|[.., .., (A, 3), (A, 1)]| |
There was a problem hiding this comment.
is the boolean after the symbol in the example corresponding to !last?
There was a problem hiding this comment.
Thanks for catch this. I've forgotten to put last in this example.
This boolean value next to the symbol is data valid value.
I'll update this example to include last value
| repeating symbols to the output stream that contains the symbols and | ||
| the number of its consecutive occurrences in the input stream. | ||
|
|
||
| Overall, we can break down the data processing into four stages: reduction, alignment, compression, and output generation. The division of responsibility allowed the specialized blocks to efficiently process data and made it possible test each functionality separately. |
There was a problem hiding this comment.
did we measure the overhead of splitting the process into different proc in term of additional registers and wires, compared to a single proc implementation?
There was a problem hiding this comment.
I haven't checked how much more resource does multi process implementation uses over single process implementation, and how max frequency is affected by this.
|
|
||
| ### Process | ||
| 1. Reduce step - this process takes incoming symbols and symbol_valid | ||
| and reduces them into symbol count pairs. This step is stateless. |
There was a problem hiding this comment.
couldn't stateless proc be in theory implemented as fn?
There was a problem hiding this comment.
All state less processes are effectively a fns, but as far as I know you can't put recv/send operations into fns.
Using fn would potentially reduce fmax, as first proc would have to do all the processing before it would be able to update it's state, or we would reduce pipeline throughput when using II>1 to improve fmax.
| ## Encoder processing pipeline detailed breakdown. | ||
|
|
||
| ### Initial conditions | ||
| - input width is 4 symbols wide, |
There was a problem hiding this comment.
does this effectively assume that the input bit-width is always a multiple of the symbols size? i.e: you couldn't have block with a u9 input bus that process repeating u2 symbol?
There was a problem hiding this comment.
Yes, implementation assumes that the input is always x symbols wide.
Handling u9 to u2 conversion should be done by additional proc.
As it will know how to handle that extra one bit,
if that bit is a part of symbol that was split on u9 boundary, or it has some other meaning.
981283c to
e4e50e9
Compare
This commit adds valid signals to PlainData interface. CompressedData interface uses `count > 0` to define valid symbol count pair. This is in preparation for a multisymbol RLE encoder implementation. Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
This encoder is capable of ingesting multiple symbols and produces multiple compressed pairs. It should offer faster compression in exchange for area used. Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
e4e50e9 to
d28d81b
Compare
|
Should we close this in favor of the work on #1211 ? |
|
@tmichalak closing this now that the more relevant #1315 has been merged. |
This PR adds more advanced implementation of RunLengthEncoder that is capable of ingesting
multiple symbols at once and produce multiple pairs in output.
This PR depends on #1006
It updates implementations of simple encoder and simple decoder to work with new
channel description.
Main differentiating factor is that original RLE allows for symbol width to be adjusted.
This implementation adds ability to process multiple symbols at once.
Let's take a base RLE and multisybol RLE with input that takes up to 4 symbols and outputs up to 2 pairs
and following sequence:
AABCCCCBA, A, B, C, C, C, C, B.Generating output takes require 5 transactions:
(2,A), (1,B), (3,C), (1,C), (1,B)AABC, CCCB.Generating output takes only 3 transactions:
[(2,A), (1,B)], [(3,C), (1,C)], [(1,B)]Multisybol RL encoder should have higher throughput but wt will use more logic to implement.