Add multisymbol RLE implementation by mtdudek · Pull Request #1052 · google/xls

mtdudek · 2023-07-03T16:33:57Z

This PR adds more advanced implementation of RunLengthEncoder that is capable of ingesting
multiple symbols at once and produce multiple pairs in output.

This PR depends on #1006
It updates implementations of simple encoder and simple decoder to work with new
channel description.

Main differentiating factor is that original RLE allows for symbol width to be adjusted.
This implementation adds ability to process multiple symbols at once.
Let's take a base RLE and multisybol RLE with input that takes up to 4 symbols and outputs up to 2 pairs
and following sequence:AABCCCCB

base RLE has to ingest symbols one by one. It takes 8 channel transactions A, A, B, C, C, C, C, B.
Generating output takes require 5 transactions: (2,A), (1,B), (3,C), (1,C), (1,B)
multisymbol RLE will ingest symbols in 2 transactions as AABC, CCCB.
Generating output takes only 3 transactions: [(2,A), (1,B)], [(3,C), (1,C)], [(1,B)]

Multisybol RL encoder should have higher throughput but wt will use more logic to implement.

mtdudek · 2023-07-20T09:00:09Z

@proppy I've updated this PR to depend on #1073.

proppy · 2023-07-24T18:32:16Z

can you rebase?

mtdudek · 2023-07-25T09:47:41Z

@proppy I've rebased it on to the newest main

proppy · 2023-07-25T10:01:26Z

implementation of RunLengthEncoder that is capable of ingesting multiple symbols at once and produce multiple pairs in output

Can you detail a little a bit more in the PR description (or the .x header) on how this differ from the original implementation, maybe with examples of input / output; the original implementation did support variable symbol width: so it'd be good to highligh how "multisymbol" differs from "larger symbol".

mtdudek · 2023-07-31T12:42:06Z

I'll update .x file with similar explanation and waveform.
I've update PR description.

proppy · 2023-08-24T14:50:29Z

[(2,A), (1,B)], [(3,C), (1,C)], [(1,B)]

Curious why there is three transaction here, and two separate packets for C?

proppy · 2023-08-24T14:58:35Z

+
+// This encoder is implemented as a net of 4 processes.
+// 1. Reduce stage - this process takes incoming symbols and symbol_valid
+// and reduces them into symbol count pairs. This stage is stateless.


if the stage is stateless, does it need to be a proc?

As far as I understand DSLX if I want to perform channel operations I have to use proc's.

proppy · 2023-08-24T15:00:26Z

+// This encoder is implemented as a net of 4 processes.
+// 1. Reduce stage - this process takes incoming symbols and symbol_valid
+// and reduces them into symbol count pairs. This stage is stateless.
+// 2. Realign stage - this process moves pairs emitted from previous stage


can you add a note describing the state for each "stage" when they have one?

proppy · 2023-08-24T15:02:17Z

+
+// This encoder is implemented as a net of 4 processes.
+// 1. Reduce stage - this process takes incoming symbols and symbol_valid
+// and reduces them into symbol count pairs. This stage is stateless.


isn't this stage equivalent to the current rle_enc can the implementation be shared?

I can't see any obvious code that can be shared between reduce step and rle_enc.
Main difference is that rle_enc code has to take into account state and update it f necessary,
while reduce step doesn't have one.

proppy · 2023-08-24T15:03:41Z

+// 1. Reduce stage - this process takes incoming symbols and symbol_valid
+// and reduces them into symbol count pairs. This stage is stateless.
+// 2. Realign stage - this process moves pairs emitted from previous stage
+// so that they are align to the left, it also calculates propagation distance


can you give an example to show 'align to the left` refer to?

Sure, let the input to 1 step be [A, A, B, B].
Output from the first step will be [..., (A,2), ..., (B,2)].
2 step will transform it to [(A,2), (B,2), ..., ...].

see the comment below can you also explain what ... refers to?

Here ... was used for an invalid data symbol, counter pair.

proppy · 2023-08-24T15:03:51Z

+// 1. Reduce stage - this process takes incoming symbols and symbol_valid
+// and reduces them into symbol count pairs. This stage is stateless.
+// 2. Realign stage - this process moves pairs emitted from previous stage
+// so that they are align to the left, it also calculates propagation distance


can you give an example of a propagation distance?

Let's assume that count is 2 bit wide and input can ingest 4 symbols at once.
Let the input be [A,A,A,A], so after the second stage it will look like:
[(A,3), (A,1), ..., ...] and propagation distance will be 1.
For input [A,B,B,B] second step will emit [(A,1), (B,3), ..., ...] with propagation distance 0.

see the comment below, can you update the comment with a textual explanation of what the propagation distance refer to? (here is seems that it mainly indicate with a packet has been broken or not)

proppy · 2023-08-24T15:04:26Z

+// 2. Realign stage - this process moves pairs emitted from previous stage
+// so that they are align to the left, it also calculates propagation distance
+// for the first pair.
+// 3. Core stage - this stage is stateful. It takes align pairs,


can you document what the state is?

proppy · 2023-08-24T15:05:04Z

+// so that they are align to the left, it also calculates propagation distance
+// for the first pair.
+// 3. Core stage - this stage is stateful. It takes align pairs,
+// and combines them with its state.It outputs multiple symbol/count pairs.


can you give an example of the output to show how it different from the received input?

proppy · 2023-08-24T15:05:32Z

+// for the first pair.
+// 3. Core stage - this stage is stateful. It takes align pairs,
+// and combines them with its state.It outputs multiple symbol/count pairs.
+// 4 - Adjust Width stage - this stage takes output from the core stage.


proppy · 2023-08-24T15:06:16Z

+// This encoder is implemented as a net of 4 processes.
+// 1. Reduce stage - this process takes incoming symbols and symbol_valid
+// and reduces them into symbol count pairs. This stage is stateless.
+// 2. Realign stage - this process moves pairs emitted from previous stage


does it takes input from the preceding "stage"? if yes, maybe we should state it explicitly.

proppy · 2023-08-24T15:06:57Z

+// 2. Realign stage - this process moves pairs emitted from previous stage
+// so that they are align to the left, it also calculates propagation distance
+// for the first pair.
+// 3. Core stage - this stage is stateful. It takes align pairs,


does it takes input from the preceding "stage"? if yes, maybe we should state it explicitly.

proppy · 2023-08-24T15:07:45Z

+// 3. Core stage - this stage is stateful. It takes align pairs,
+// and combines them with its state.It outputs multiple symbol/count pairs.
+// 4 - Adjust Width stage - this stage takes output from the core stage.
+// If output can handle more or equal number of pairs as


How does the "stage" know that the "output can handle more or equal the number of pairs", can you give an example

Output and input widths are parametrized INPUT_WIDTH and OUTPUT_WIDTH.
If OUTPUT_WIDTH is greater or equal to INPUT_WIDTH, adjust width step has nothing to adjust.

proppy · 2023-08-24T15:08:29Z

+// If output can handle more or equal number of pairs as
+// input number of symbols. This stage does nothing.
+// If the output is narrower than the input,
+// this stage will serialize symbol counter pairs.


this stage will serialize symbol counter pairs.

as it will split them to match the size of the desired output? can you give an example?

Let input width be 4 and output with is 2. Core step can emit the following : [(A,1), (B,1) , (A,1), (B,1)].
Adjust Width step will serialize it to [(A,1(, (B,1)], [(A,1), (B,1)]

mtdudek · 2023-08-25T15:40:54Z

[(2,A), (1,B)], [(3,C), (1,C)], [(1,B)]

Curious why there is three transaction here, and two separate packets for C?

This was with an assumption that RLE counter is 2 bit wide, so it can only hold at most 3 symbols, while C was repeated 4 times.

proppy · 2023-08-31T15:02:07Z

 pub proc RunLengthDecoder<SYMBOL_WIDTH: u32, COUNT_WIDTH: u32> {
-  input_r: chan<DecInData<SYMBOL_WIDTH, COUNT_WIDTH>> in;
-  output_s: chan<DecOutData<SYMBOL_WIDTH>> out;
+  input_r: chan<DecInData<SYMBOL_WIDTH, COUNT_WIDTH, 1>> in;


I wonder if it would be more friendly to parameterize with max_count and use https://google.github.io/xls/dslx_std/#stdclog2 + 1 to represent the width?

As far as I remember there are parts of code that assume max_count to be 2^k-1.
I'm not sure when user would like to set max_count to be something other than power of 2 -1.

proppy · 2023-09-19T08:00:56Z

+// The behavior of the encoder is presented on the waveform below:
+
+// This encoder is implemented as a net of 4 proc.
+// 1. Reduce proc - this process takes incoming symbols and symbol_valid


is 1. effectively equivalent to the existing RLE enc implementation?

It's not exactly the same thing as existing RLE enc. Existing RLE has state that is combined with incoming data,
this stage doesn't have state, it only works on data visible in a single channel payload and doesn't include stream history.

proppy · 2023-09-19T08:06:38Z

+// distance for the first pair.
+// Example behaviours:
+// 1)
+//    input:  [.., (A, 2), .., (B, 2)]


what does .. means here?

It was to mark invalid data and data, count pairs.

proppy · 2023-09-19T08:08:01Z

+// and reduces them into symbol count pairs. This step is stateless.
+// 2. Realign proc - this process moves pairs emitted from the reduce step
+// so that they are aligned to the left, it also calculates propagation
+// distance for the first pair.


can you also explain what "propagation distance" mean here? (it's not clear, at least to me, from just reading the example)

Here propagation distance is the number of consecutive pairs, starting form first pair, that have the same symbol value as the first one and have their counter maxed out. This way 'Core proc' can only check "propagation distance"th pair value and counter.

If value differs from the internal state, then all pairs from [0:"propagation distance") will differ and core should combine payload with its state.

If value is equal internal state, then "Core proc" only checks if internal state counter + "propagation distance"th pair counter overflow.

proppy · 2023-09-19T08:09:49Z

+//    output: [(A, 2), (B, 2), .., ..]
+//    propagation distance: 0
+// 2)
+//    input:  [.., .., (A, 3), (A, 1)]


(A, 3), (A, 1) can this actually happen? wouldn't the previous step simply emit (A, 4)?

In the examples I was assuming counter 2 bit wide, so that's why there is (A, 3), (A, 1) and not (A, 4).

proppy · 2023-09-19T08:16:48Z

+// 3. Core proc - this step is stateful. It takes align pairs from
+// the realign step, and combines them with its state to create multiple
+// symbol/count pairs output. State is represented by following tuple
+// `<symbol, count, last>`. It contains symbol and count from last pair


does it effectively means that we're counting again? is that right to assume that if each transaction only include one symbol this does nothing?

This is the first proc with a state that holds stream history. If a stream would only have one data value in all symbols.
First processing state would reduce incoming data stream into (data_value, input_count), this stage would combine this pair with one that preceded it, and only emit output pair when either internal counter overflows or stream ends.

proppy · 2023-09-21T16:01:18Z

+|||
+|-----|-------|
+|input|output |
+|[(A, True), (A, True), (A, True), (A, True)]|[.., .., (A, 3), (A, 1)]|


is the boolean after the symbol in the example corresponding to !last?

Thanks for catch this. I've forgotten to put last in this example.
This boolean value next to the symbol is data valid value.
I'll update this example to include last value

proppy · 2023-09-21T16:09:54Z

+repeating symbols to the output stream that contains the symbols and
+the number of its consecutive occurrences in the input stream.
+
+Overall, we can break down the data processing into four stages: reduction, alignment, compression, and output generation. The division of responsibility allowed the specialized blocks to efficiently process data and made it possible test each functionality separately.


did we measure the overhead of splitting the process into different proc in term of additional registers and wires, compared to a single proc implementation?

I haven't checked how much more resource does multi process implementation uses over single process implementation, and how max frequency is affected by this.

proppy · 2023-09-21T16:12:16Z

+
+### Process
+1. Reduce step - this process takes incoming symbols and symbol_valid
+and reduces them into symbol count pairs. This step is stateless.


couldn't stateless proc be in theory implemented as fn?

All state less processes are effectively a fns, but as far as I know you can't put recv/send operations into fns.
Using fn would potentially reduce fmax, as first proc would have to do all the processing before it would be able to update it's state, or we would reduce pipeline throughput when using II>1 to improve fmax.

proppy · 2023-09-21T16:21:44Z

+## Encoder processing pipeline detailed breakdown.
+
+### Initial conditions
+- input width is 4 symbols wide,


does this effectively assume that the input bit-width is always a multiple of the symbols size? i.e: you couldn't have block with a u9 input bus that process repeating u2 symbol?

Yes, implementation assumes that the input is always x symbols wide.
Handling u9 to u2 conversion should be done by additional proc.
As it will know how to handle that extra one bit,
if that bit is a part of symbol that was split on u9 boundary, or it has some other meaning.

This commit adds valid signals to PlainData interface. CompressedData interface uses `count > 0` to define valid symbol count pair. This is in preparation for a multisymbol RLE encoder implementation. Signed-off-by: Maciej Dudek <mdudek@antmicro.com>

This encoder is capable of ingesting multiple symbols and produces multiple compressed pairs. It should offer faster compression in exchange for area used. Signed-off-by: Maciej Dudek <mdudek@antmicro.com>

Signed-off-by: Maciej Dudek <mdudek@antmicro.com>

proppy · 2024-04-03T08:19:17Z

Should we close this in favor of the work on #1211 ?

proppy · 2024-08-19T04:24:33Z

@tmichalak closing this now that the more relevant #1315 has been merged.

proppy changed the title ~~Add mutlisymbol RLE implementation~~ Add multisymbol RLE implementation Jul 19, 2023

mtdudek mentioned this pull request Jul 19, 2023

modules/rle: consider offseting symbol count by one #1070

Open

mtdudek force-pushed the MultiSymbolRLE branch 2 times, most recently from 708cfaa to 7d9f08b Compare July 20, 2023 08:55

mtdudek force-pushed the MultiSymbolRLE branch from 7d9f08b to bcd00e3 Compare July 25, 2023 09:43

mtdudek force-pushed the MultiSymbolRLE branch from bcd00e3 to c9daa7d Compare July 31, 2023 14:55