Skip to content

jvtoppa/smr-kernel

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Super Maximal Repeats Kernel

The algorithm we present is a strategy for finding a good reference for Relative Lempel Ziv by leveraging Super Maximal Repeats. Let $S[1,n] \in \Sigma^n$ be a string.

A supermaximal repeat (SMR) is a repeat that is not contained inside any other repeat. The notions of repeats and supermaximal repeats can naturally be extended to string sets. One way of doing this is to convert the set ${S_1, ... , S_k}$ to the string $ S' = S_1 \$_1 S_2 \$_2 \dots \$_{k-1} S_k$, where $\$_1, ... , \$_k$ are all distinct and not part of the original alphabet. However, to construct a reference, we can define the following function:

We define k(S) for a string S as the following function:

  1. We define the set SMR' as all the Supermaximal Repeats of S
  2. We join (concatenate) each element of SMR', while making sure that no suffix of $SMR'[k-1]$ is a prefix of $SMR'[k]$ for any $k$ in the interval $1 <= k <= len(SMR')$. For periodic strings, we write only the primitive root of a substring i.e., $pr(xyxyxy) = xy$

In other words,

$$ \prod_{i=1}^{n}pr(SMR'_i) = k(S) $$

As such, it is easy to notice that $k(k(S), k(k(...(k(S))..)) => k^m(S) = \epsilon$, which corresponds to the m-th (last) recursion step of the kernelization process.

Algorithm

We use the LZ77 ($Z$) /RLZ ($\bar{Z}$) variant where each phrase is the shortest substring of length $\ell$ that does not appear before. Each phrase can therefore be encoded as a triple $(i,\ell-1,c)$ where $i$ is the pointer pointing to the source of the phrase, $\ell-1$ is the number of copied characters and c is the extra character. In practice, i and c are encoded with no compression, while $\ell$ is encoded with Elias-Fano.

To simplify notation, when writing $\log(x)$ we actually mean $max(1, ceil[\log_2(x)])$

The bit-size of RLZ with the reference is:

$$ R\log(\Sigma) + \bar{Z}\log(R) + \bar{Z}\log(n/\bar{z}) + \bar{Z}\log(\Sigma) + 2\bar{Z} = B_\bar{Z} $$

english.50mb (first 100k chars)

We have noticed experimentally (Figure 1) that there exists a global minimum for the size of RLZ with this strategy - as such, we can simply optimize our reference with the bitsize of RLZ + reference in mind. For that, we do a recursion call until $$B_\bar{Z}[\bar{Z}(k^{i+1}(S))] > B_\bar{Z}[RLZ[\bar{Z}(k^i(S))]$$ and simply use $\bar{Z}(k^i(S))$ as the chosen reference.

For now, we do not build the Relative Lempel Ziv data struture to memory, but we plan to.

Running the code

Clone the repository with git clone --recursive. To build the code, run make. Then, save the file you want to encode in data/ and run ./build/smr < data/file.txt.

About

string kernelization based on supermaximal repeats

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C++ 53.3%
  • Python 43.1%
  • Makefile 3.6%