?? lcomp.cpp
字號:
/* lComp file compressor/archiver. Release by Andreas Morphis, Aug. 22, 2008
Copyright (C) 2009 Larry Mugim
LICENSE
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of
the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details at
Visit <http://www.gnu.org/copyleft/gpl.html>.
To install and use in Windows:
- To install, put lComp.exe or a shortcut to it on your desktop.
- To compress a file or folder, drop it on the lComp icon.
- To decompress, drop a .lComp file on the icon.
A .lComp extension is added for compression, removed for decompression.
The output will go in the same folder as the input.
While lComp is working, a command window will appear and report
progress. When it is done you can close the window by pressing
ENTER or clicking [X].
COMMAND LINE INTERFACE
- To install, put lComp.exe somewhere in your PATH.
- To compress: lComp [-N] file1 [file2...]
- To decompress: lComp [-d] file1.lComp [dir2]
- To view contents: more < file1.lComp
The compressed output file is named by adding ".lComp" extension to
the first named file (file1.lComp). Each file that exists will be
added to the archive and its name will be stored without a path.
The option -N specifies a compression level ranging from -0
(fastest) to -9 (smallest). The default is -5. If there is
no option and only one file, then the program will pause when
finished until you press the ENTER key (to support drag and drop).
If file1.lComp exists then it is overwritten.
If the first named file ends in ".lComp" then it is assumed to be
an archive and the files within are extracted to the same directory
as the archive unless a different directory (dir2) is specified.
The -d option forces extraction even if there is not a ".lComp"
extension. If any output file already exists, then it is compared
with the archive content and the first byte that differs is reported.
No files are overwritten or deleted. If there is only one argument
(no -d or dir2) then the program will pause when finished until
you press ENTER.
For compression, if any named file is actually a directory, then all
files and subdirectories are compressed, preserving the directory
structure, except that empty directories are not stored, and file
attributes (timestamps, permissions, etc.) are not preserved.
During extraction, directories are created as needed. For example:
lComp -4 c:\tmp\foo bar
compresses foo and bar (if they exist) to c:\tmp\foo.lComp at level 4.
lComp -d c:\tmp\foo.lComp .
extracts foo and compares bar in the current directory. If foo and bar
are directories then their contents are extracted/compared.
There are no commands to update an existing archive or to extract
part of an archive. Files and archives larger than 2GB are not
supported (but might work on 64-bit machines, not tested).
File names with nonprintable characters are not supported (spaces
are OK).
TO COMPILE
There are 2 files: lComp.cpp (C++) and paq7asm.asm (NASM/YASM).
paq7asm.asm is the same as in paq7 and paq8x. lComp.cpp recognizes the
following compiler options:
-DWINDOWS (to compile in Windows)
-DUNIX (to compile in Unix, Linux, Solairs, MacOS/Darwin, etc)
-DNOASM (to replace paq7asm.asm with equivalent C++)
-DDEFAULT_OPTION=N (to change the default compression level from 5 to N).
If you compile without -DWINDOWS or -DUNIX, you can still compress files,
but you cannot compress directories or create them during extraction.
You can extract directories if you manually create the empty directories
first.
Use -DEFAULT_OPTION=N to change the default compression level to support
drag and drop on machines with less than 256 MB of memory. Use
-DDEFAULT_OPTION=4 for 128 MB, 3 for 64 MB, 2 for 32 MB, etc.
Use -DNOASM for non x86-32 machines, or older than a Pentium-MMX (about
1997), or if you don't have NASM or YASM to assemble paq7asm.asm. The
program will still work but it will be slower. For NASM in Windows,
use the options "--prefix _" and either "-f win32" or "-f obj" depending
on your C++ compiler. In Linux, use "-f elf".
Recommended compiler commands and optimizations:
MINGW g++:
nasm paq7asm.asm -f win32 --prefix _
g++ lComp.cpp -DWINDOWS -O2 -Os -s -march=pentiumpro -fomit-frame-pointer -o lComp.exe paq7asm.obj
Borland:
nasm paq7asm.asm -f obj --prefix _
bcc32 -DWINDOWS -O -w-8027 lComp.cpp paq7asm.obj
Mars:
nasm paq7asm.asm -f obj --prefix _
dmc -DWINDOWS -Ae -O lComp.cpp paq7asm.obj
UNIX/Linux (PC):
nasm -f elf paq7asm.asm
g++ lComp.cpp -DUNIX -O2 -Os -s -march=pentiumpro -fomit-frame-pointer -o lComp paq7asm.o
Non PC (e.g. PowerPC under MacOS X)
g++ lComp.cpp -O2 -DUNIX -DNOASM -s -o lComp
MinGW produces faster executables than Borland or Mars, but Intel 9
is about 4% faster than MinGW).
ARCHIVE FILE FORMAT
An archive has the following format. It is intended to be both
human and machine readable. The header ends with CTRL-Z (Windows EOF)
so that the binary compressed data is not displayed on the screen.
lComp -N CR LF
size TAB filename CR LF
size TAB filename CR LF
...
CTRL-Z
compressed binary data
-N is the option (-0 to -9), even if a default was used.
Plain file names are stored without a path. Files in compressed
directories are stored with path relative to the compressed directory
(using UNIX style forward slashes "/"). For example, given these files:
123 C:\dir1\file1.txt
456 C:\dir2\file2.txt
Then
lComp archive \dir1\file1.txt \dir2
will create archive.lComp with the header:
lComp -5
123 file1.txt
456 dir2/file2.txt
The command:
lComp archive.lComp C:\dir3
will create the files:
C:\dir3\file1.txt
C:\dir3\dir2\file2.txt
Decompression will fail if the first 7 bytes are not "lComp -". Sizes
are stored as decimal numbers. CR, LF, TAB, CTRL-Z are ASCII codes
13, 10, 9, 26 respectively.
ARITHMETIC CODING
The binary data is arithmetic coded as the shortest base 256 fixed point
number x = SUM_i x_i 256^-1-i such that p(<y) <= x < p(<=y), where y is the
input string, x_i is the i'th coded byte, p(<y) (and p(<=y)) means the
probability that a string is lexicographcally less than (less than
or equal to) y according to the model, _ denotes subscript, and ^ denotes
exponentiation.
The model p(y) for y is a conditional bit stream,
p(y) = PROD_j p(y_j | y_0..j-1) where y_0..j-1 denotes the first j
bits of y, and y_j is the next bit. Compression depends almost entirely
on the ability to predict the next bit accurately.
MODEL MIXING
lComp uses a neural network to combine a large number of models. The
i'th model independently predicts
p1_i = p(y_j = 1 | y_0..j-1), p0_i = 1 - p1_i.
The network computes the next bit probabilty
p1 = squash(SUM_i w_i t_i), p0 = 1 - p1 (1)
where t_i = stretch(p1_i) is the i'th input, p1_i is the prediction of
the i'th model, p1 is the output prediction, stretch(p) = ln(p/(1-p)),
and squash(s) = 1/(1+exp(-s)). Note that squash() and stretch() are
inverses of each other.
After bit y_j (0 or 1) is received, the network is trained:
w_i := w_i + eta t_i (y_j - p1) (2)
where eta is an ad-hoc learning rate, t_i is the i'th input, (y_j - p1)
is the prediction error for the j'th input but, and w_i is the i'th
weight. Note that this differs from back propagation:
w_i := w_i + eta t_i (y_j - p1) p0 p1 (3)
which is a gradient descent in weight space to minimize root mean square
error. Rather, the goal in compression is to minimize coding cost,
which is -log(p0) if y = 1 or -log(p1) if y = 0. Taking
the partial derivative of cost with respect to w_i yields (2).
MODELS
Most models are context models. A function of the context (last few
bytes) is mapped by a lookup table or hash table to a state which depends
on the bit history (prior sequence of 0 and 1 bits seen in this context).
The bit history is then mapped to p1_i by a fixed or adaptive function.
There are several types of bit history states:
- Run Map. The state is (b,n) where b is the last bit seen (0 or 1) and
n is the number of consecutive times this value was seen. The initial
state is (0,0). The output is computed directly:
t_i = (2b - 1)K log(n + 1).
where K is ad-hoc, around 4 to 10. When bit y_j is seen, the state
is updated:
(b,n) := (b,n+1) if y_j = b, else (y_j,1).
- Stationary Map. The state is p, initially 1/2. The output is
t_i = stretch(p). The state is updated at ad-hoc rate K (around 0.01):
p := p + K(y_j - p)
- Nonstationary Map. This is a compromise between a stationary map, which
assumes uniform statistics, and a run map, which adapts quickly by
discarding old statistics. An 8 bit state represents (n0,n1,h), initially
(0,0,0) where:
n0 is the number of 0 bits seen "recently".
n1 is the number of 1 bits seen "recently".
n = n0 + n1.
h is the full bit history for 0 <= n <= 4,
the last bit seen (0 or 1) if 5 <= n <= 15,
0 for n >= 16.
The primaty output is t_i := stretch(sm(n0,n1,h)), where sm(.) is
a stationary map with K = 1/256, initiaized to
sm(n0,n1,h) = (n1+(1/64))/(n+2/64). Four additional inputs are also
be computed to improve compression slightly:
p1_i = sm(n0,n1,h)
p0_i = 1 - p1_i
t_i := stretch(p_1)
t_i+1 := K1 (p1_i - p0_i)
t_i+2 := K2 stretch(p1) if n0 = 0, -K2 stretch(p1) if n1 = 0, else 0
t_i+3 := K3 (-p0_i if n1 = 0, p1_i if n0 = 0, else 0)
t_i+4 := K3 (-p0_i if n0 = 0, p1_i if n1 = 0, else 0)
where K1..K4 are ad-hoc constants.
h is updated as follows:
If n < 4, append y_j to h.
Else if n <= 16, set h := y_j.
Else h = 0.
The update rule is biased toward newer data in a way that allows
n0 or n1, but not both, to grow large by discarding counts of the
opposite bit. Large counts are incremented probabilistically.
Specifically, when y_j = 0 then the update rule is:
n0 := n0 + 1, n < 29
n0 + 1 with probability 2^(27-n0)/2 else n0, 29 <= n0 < 41
n0, n = 41.
n1 := n1, n1 <= 5
round(8/3 lg n1), if n1 > 5
swapping (n0,n1) when y_j = 1.
Furthermore, to allow an 8 bit representation for (n0,n1,h), states
exceeding the following values of n0 or n1 are replaced with the
state with the closest ratio n0:n1 obtained by decrementing the
smaller count: (41,0,h), (40,1,h), (12,2,h), (5,3,h), (4,4,h),
(3,5,h), (2,12,h), (1,40,h), (0,41,h). For example:
(12,2,1) 0-> (7,1,0) because there is no state (13,2,0).
- Match Model. The state is (c,b), initially (0,0), where c is 1 if
the context was previously seen, else 0, and b is the next bit in
this context. The prediction is:
t_i := (2b - 1)Kc log(m + 1)
where m is the length of the context. The update rule is c := 1,
b := y_j. A match model can be implemented efficiently by storing
input in a buffer and storing pointers into the buffer into a hash
table indexed by context. Then c is indicated by a hash table entry
and b can be retrieved from the buffer.
CONTEXTS
High compression is achieved by combining a large number of contexts.
Most (not all) contexts start on a byte boundary and end on the bit
immediately preceding the predicted bit. The contexts below are
modeled with both a run map and a nonstationary map unless indicated.
- Order n. The last n bytes, up to about 16. For general purpose data.
Most of the compression occurs here for orders up to about 6.
An order 0 context includes only the 0-7 bits of the partially coded
byte and the number of these bits (255 possible values).
- Sparse. Usually 1 or 2 of the last 8 bytes preceding the byte containing
the predicted bit, e.g (2), (3),..., (8), (1,3), (1,4), (1,5), (1,6),
(2,3), (2,4), (3,6), (4,8). The ordinary order 1 and 2 context, (1)
or (1,2) are included above. Useful for binary data.
- Text. Contexts consists of whole words (a-z, converted to lower case
and skipping other values). Contexts may be sparse, e.g (0,2) meaning
the current (partially coded) word and the second word preceding the
current one. Useful contexts are (0), (0,1), (0,1,2), (0,2), (0,3),
(0,4). The preceding byte may or may not be included as context in the
current word.
- Formatted text. The column number (determined by the position of
the last linefeed) is combined with other contexts: the charater to
the left and the character above it.
- Fixed record length. The record length is determined by searching for
byte sequences with a uniform stride length. Once this is found, then
the record length is combined with the context of the bytes immediately
preceding it and the corresponding byte locations in the previous
one or two records (as with formatted text).
- Context gap. The distance to the previous occurrence of the order 1
or order 2 context is combined with other low order (1-2) contexts.
- FAX. For 2-level bitmapped images. Contexts are the surrounding
pixels already seen. Image width is assumed to be 1728 bits (as
in calgary/pic).
- Image. For uncompressed 24-bit color BMP and TIFF images. Contexts
are the high order bits of the surrounding pixels and linear
combinations of those pixels, including other color planes. The
image width is detected from the file header. When an image is
detected, other models are turned off to improve speed.
- JPEG. Files are further compressed by partially uncompressing back
to the DCT coefficients to provide context for the next Huffman code.
Only baseline DCT-Huffman coded files are modeled. (This ia about
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -