|
Data Compression
Criteria
Survey Formats
Basics
Compression Methods
Data Formats
Applications and Projects
Freeware Applications
Freeware Libraries
Prof. Applications
Reference Data Sets
Calgary Corpus
Comparisons Calgary
Glossary
Index
Download
|

Calgary Corpus
Ian H. Witten and Timothy C Bell arranged the so-called "Calgary Text Compression Corpus" and published it in 1989 for the first time. The large version consists of 18 files representing 9 different data types.
All text files base on English language. The is encoded according to the ASCII character set. Despite its name, the "Calgary Text Compression Corpus" also contains machine code, scientific, and graphic data (about 27%).
| File |
Size |
Contents |
| bib |
111.261 |
structured text (bibliography), structure well-suited to import data into a data base |
| book1 |
768.771 |
text, novel |
| book2 |
610.856 |
formatted text, scientific |
| geo |
102.400 |
geophysical data |
| news |
377.109 |
formatted text, script with news |
| obj1 |
21.504 |
program code (object file), executable machine code |
| obj2 |
246.814 |
program code (object file), executable machine code |
| paper1 |
53.161 |
formatted text, scientific |
| paper2 |
82.199 |
formatted text, scientific |
| paper3 |
46.526 |
formatted text, scientific |
| paper4 |
13.286 |
formatted text, scientific |
| paper5 |
11.954 |
formatted text, scientific |
| paper6 |
38.105 |
formatted text, scientific |
| pic |
513.216 |
image data (black and white) |
| progc |
39.611 |
source code |
| progl |
71.646 |
source code |
| progp |
49.379 |
source code |
| trans |
93.695 |
transcript terminal data |
| |
3.251.493 |
Sum |
| |
3.265.024 |
TAR |
Meanwhile the Calgary Corpus is handled as a quasi standard to compare lossless compression procedures and formats. The name is derived from the University of Calgary. One of the authors, Ian Witten, was employed there at that time.
|
< ^ >
|
External Links:
Download University of Calgary (FTP) [ ]
|
|
|