Monthly Archives: April 2011

Talk on Wednesday on Data Structures.

Next wednesday (May 4th) I’ll give a talk on a basic data structure problem at DTU (see the full abstract below). Students at 02283 have all the basic requirement to follow the talk (I’ll do lots of heavy path decompositions!) and I’ll keep it self-contained. You are all welcome. It’s good chance to see what kind of research is happening at the moment at DTU in algorithms and get inspiration for thesis topics.



Wednesday 4th May 2011 at 12:15
Room 030, Building 322, DTU Informatics

Random Access to Grammar Compressed Strings

Philip Bille,
Technical University of Denmark


Grammar based compression,  where one replaces a long string  by a small
context-free grammar that generates the string, is a simple and powerful
paradigm  that  captures  many   of  the  popular  compression  schemes,
including   the  Lempel-Ziv   family,  Run-Length   Encoding,  Byte-Pair
Encoding,  Sequitur, and  Re-Pair. In  this  paper, we  present a  novel
grammar  representation  that  allows  efficient random  access  to  any
character or substring  without decompressing the string.

Let $S$ be a string of length $N$ compressed into a context-free grammar
$\mathcal{S}$   of  size   $n$.  We   present  two   representations  of
$\mathcal{S}$   achieving $O(\log  N)$  random access  time, and  either
$O(n\cdot  \alpha_k(n))$  construction time  and  space  on the  pointer
machine model, or  $O(n)$ construction time and space on  the RAM. Here,
$\alpha_k(n)$  is  the  inverse  of  the  $k^{th}$  row  of  Ackermann’s
function. Our representations also  efficiently support decompression of
any substring in  $S$: we can decompress any substring  of length $m$ in
the  same complexity  as a  single  random access  query and  additional
$O(m)$  time.   Combining  these   results  with  fast   algorithms  for
uncompressed  approximate string  matching  leads  to several  efficient
algorithms for approximate string matching on grammar compressed strings
without  decompression.   For  instance,  we can  find  all  approximate
occurrences  of  a  pattern  $P$  with   at  most  $k$  errors  in  time
$O(n(\min\{|P|k, k^4  + |P|\} +  \log N) +  \occ)$, where $\occ$  is the
number of occurrences of $P$ in $S$. Finally,  we are able to generalize
our  results to  navigation and  other operations  on grammar-compressed
{\em trees}.

All of the  above bounds significantly improve the  currently best known
results. To  achieve these bounds,  we introduce several  new techniques
and  data structures  of independent  interest, including  a predecessor
data structure,  two “biased” weighted  ancestor data structures,  and a
compact representation of heavy-paths in grammars.

Joint work with: Gad M. Landau, Rajeev Raman, Kunihiko Sadakane,
Srinivasa Rao Satti, and Oren Weimann