Could you give me some explain of LZ77 Algorithm?-CodePudding

I'm trying to learn LZ77 algorithm with my friend, and some case give us a confusion.

for example)

init

search buffer size: 7
look-ahead buffer size: 8
original string: abcabbcabbcabca

current window: abcabbc view: abbcabca

What I thought the LLD tuple is : Literal: 'a' Length: 4 Distance: 4

What my friend thought the LLD tuple is : Literal: 'c' Length: 6 Distance: 4

I thought the algorithm to find longest match string only checks in search buffer, but he says there is no limit to find match string.

Who's answer is correct?

CodePudding user response：

I can't give you a definitive answer, but I can show there's no reason LZ77 couldn't support "overly long" tuples.

I believe you're asking if the length component of a tuple can be larger than its distance component.

In other words, I believe you're asking which of the following streams of tuples would be produced:

( Length: 0, Distance: 0, Char: 'a' )
(0,0,'b')
(0,0,'c')
(2,3,'b')
(4,4,'c')^[1] Length capped to distance.
(2,4,'c')
(0,0,'a')

(0,0,'a')
(0,0,'b')
(0,0,'c')
(2,3,'b')
(7,4,'c')^[2] Length larger than distance.
(0,0,'a')

Obviously, the latter would produce better compression.

I can't give you a definitive answer, but I can show that the decoder can handle "overly long" tuples just as easily as the encoder can produce them. As such, there's no reason LZ77 couldn't support them.

Reconstructing the input from the first stream

(0,0,'a'): "" "a" ⇒ "a"
(0,0,'b'): "" "b" ⇒ "ab"
(0,0,'c'): "" "c" ⇒ "abc"
(2,3,'b'): "ab" "b" ⇒ "abcabb"
(4,4,'c'): "cabb" "c" ⇒ "abcabbcabbc"
(2,4,'c'): "ab" "c" ⇒ "abcabbcabbcabc"
(0,0,'a'): "" "a" ⇒ "abcabbcabbcabca"

Simple.

Reconstructing the input from the second stream

(0,0,'a'): "" "a" ⇒ "a"
(0,0,'b'): "" "b" ⇒ "ab"
(0,0,'c'): "" "c" ⇒ "abc"
(2,3,'b'): "ab" "b" ⇒ "abcabb"

(7,4,'c')

So far, we've produced abcabb. The tuple says the next 7 characters start 4 characters from the end of abcabb.

a b c c a b b _ _ _ _ _ _ _ c
      | | | | ^ ^ ^ ^ ^ ^ ^
      | | | | | | | | | | |
       -)-)-)-  | | | ? ? ?
         -)-)---  | |
           -)-----  |
             -------

We're missing three characters! Are are we? If it's all one buffer and we copy from left to right, we just have to keep going.

                   ------- 
                 -)-----  |
               -)-)---  | |
              | | |   | | |
              | | |   v v v
a b c c a b b _ _ _ _ _ _ _ c
      | | | | ^ ^ ^ ^
      | | | | | | | |
       -)-)-)-  | | |
         -)-)---  | |
           -)-----  |
             -------

So this works with no extra effort!

(7,4,'c'): "cabbcab" "c" ⇒ "abcabbcabbcabc"

(0,0,'a'): "" "a" ⇒ "abcabbcabbcabca"

You incorrectly said (4,4,'a').
You incorrectly said (6,4,'c').

CodePudding user response：

Your friend is right. A simpler example is that LZ77 accomplishes run-length coding with a distance 1 match, with length n-1 for a run of n copies of the same byte. (The first occurrence of the byte is a literal.)