The algorithm we present is a strategy for finding a good reference for Relative Lempel Ziv by leveraging Super Maximal Repeats. Let
A supermaximal repeat (SMR) is a repeat that is not contained inside any other repeat. The notions of repeats and supermaximal repeats can naturally be extended to string sets. One way of doing this is to convert the set
We define k(S) for a string S as the following function:
- We define the set SMR' as all the Supermaximal Repeats of S
- We join (concatenate) each element of SMR', while making sure that no suffix of
$SMR'[k-1]$ is a prefix of$SMR'[k]$ for any$k$ in the interval$1 <= k <= len(SMR')$ . For periodic strings, we write only the primitive root of a substring i.e.,$pr(xyxyxy) = xy$
In other words,
As such, it is easy to notice that
We use the LZ77 (
To simplify notation, when writing
The bit-size of RLZ with the reference is:
We have noticed experimentally (Figure 1) that there exists a global minimum for the size of RLZ with this strategy - as such, we can simply optimize our reference with the bitsize of RLZ + reference in mind. For that, we do a recursion call until
For now, we do not build the Relative Lempel Ziv data struture to memory, but we plan to.
Clone the repository with git clone --recursive. To build the code, run make. Then, save the file you want to encode in data/ and run ./build/smr < data/file.txt.
