Wednesday, October 20, 2010

Compression With C#, Part 3 - Archives

In the 3rd and (for now) last post to the topic "Compression with C#" I want to show a possibility, to compress multiple files at once in an archive and to unpack this again.
To understand this post, an understanding of the posts of part 1 and part 2 is helpful.
To pack multiple files to an archive can only be done with a little trick in C#, because by default the gzip format only supports the compression of single files. Collections of multiple files are then mostly first compressed with tar and then with gzip, which leads to the ending .tar.gz.
Although we cannot access the tar format directly in .Net, we still can implement the compression of multiple files: We just take the many files as one big file, in which the single files are split by special characters. The big file is then compressed and when decompressing split again in many files by paying respect to the separator characters.

To the technique:
As an identifier between the files I wrote the following header before every file in the stream:
|*START*OF*HEADER*|*||SIZE_OF_FILE||NAME_OF_FILE*|*END*OF*HEADER*|
There could be problems in the unlikely cause, if the content of a file resembles this structure. This can be solved, but for simplicity I will just stay with this simple header generation.

Creating / Compressing an Archive:
The path to the files to be compressed are given to the compress function as string arrays, as well as the name and path of the archive. The files are then iterated and the corresponding header followed by the content are then written to a MemoryStream.
The data from this are then written to a byte array and this is written via a GZipStream to the archive file.
Due to the fact, that all files are first collected and then written with the GZipStream, the archive can be compressed much more then in the case that all files are written one by one to the archive.
The code:

        private void CreateArchive(string[] files, string archiv)
        {
            GZipStream CompressStream = new GZipStream(new FileStream(archiv, FileMode.Create), CompressionMode.Compress);
            FileStream NormalFileStream;
            byte[] Content;

            ASCIIEncoding encoder = new ASCIIEncoding();
            byte[] HeaderStart = encoder.GetBytes("|*START*OF*HEADER*|*");
            byte[] HeaderEnd = encoder.GetBytes("*|*END*OF*HEADER*|");
            byte[] FileSize;  // size of the current file
            byte[] Separator = encoder.GetBytes("||");
            byte[] FileName; // name of the current file

            MemoryStream TempStream = new MemoryStream();

            foreach (string file in files)
            {
                NormalFileStream = new FileStream(file, FileMode.Open);
                FileSize = encoder.GetBytes(NormalFileStream.Length.ToString());
                FileName = encoder.GetBytes(file.Substring(file.LastIndexOf('\\') + 1));

                TempStream.Write(HeaderStart, 0, HeaderStart.Length);
                TempStream.Write(Separator, 0, Separator.Length);
                TempStream.Write(FileSize, 0, FileSize.Length);
                TempStream.Write(Separator, 0, Separator.Length);
                TempStream.Write(FileName, 0, FileName.Length);
                TempStream.Write(HeaderEnd, 0, HeaderEnd.Length);

                Content = new byte[NormalFileStream.Length];
                NormalFileStream.Read(Content, 0, Content.Length);
                NormalFileStream.Close();
                TempStream.Write(Content, 0, Content.Length);
            }

            byte[] BigFileContent = new byte[TempStream.Length];
            TempStream.Position = 0;
            TempStream.Read(BigFileContent, 0, BigFileContent.Length);
            CompressStream.Write(BigFileContent, 0, BigFileContent.Length);
            CompressStream.Close();
        }


Unpacking / Decompressing the Archive:
The compression was the easy part, decompressing is a bit harder. The function to decompress gets the path and name of the archive, as well as the path to where to files should be extracted to according to their original names.
First we have to decompress the archive, for that it is read by a GZipStream.
The content of this is then copied to a MemoryStream which writes its content to a byte array (this is easier from a MemoryStream, therefor this way around).
The byte array is then converted with an instance of the class ASCIIEncoder to a string. The evaluation of this is done by a loop, in every iteration one file is treated. There are 2 pointers, which point to a position in the string. The first saves the current position, the second the current searching position.
The first always points to the position, on which the current header starts, the second points to a position 22 bytes further. Since the structure of the header is known, file size and name can be readout by starting a search from the searching position for "||" and then incrementing the positions.
With a FilesStream then the parts of the byte array containing the current file (so from the current position to the current position + file size) are written to a new file, which is created in the target directory under the original name.
The code:
        private void OpenArchive(string archiv, string decompressPath)
        {
            GZipStream DecompressStream = new GZipStream(new FileStream(archiv, FileMode.Open), CompressionMode.Decompress);
            FileStream NormalFileStream;
            MemoryStream TempStream = new MemoryStream();

            ASCIIEncoding decoder = new ASCIIEncoding();
            ASCIIEncoding Encoder = new ASCIIEncoding();

            string StringFromBytes; // string representation of the read bytes
            int EndSize; // position in the header, where the field file size ends
            long FileLength; // size of the current file
            int StartFileName; // position in the header, where the field file name starts
            int EndFileName; // position in the header, where the field file name ends
            string FileName; // name of the current file
            string EmptyHeader = "|*START*OF*HEADER*|*||||*|*END*OF*HEADER*|"// "prototype" of the header
            byte[] EmptyHeaderBytes = Encoder.GetBytes(EmptyHeader); // prototype in bytes

            long CurrentPosition = 0; // current position in the file
            long CurrentSearchPosition = 22; // current searching position in the file

            DecompressStream.CopyTo(TempStream);
            byte[] BigFileContent = new byte[TempStream.Length];
            TempStream.Position = 0;
            TempStream.Read(BigFileContent, 0, BigFileContent.Length);
       
            StringFromBytes = decoder.GetString(BigFileContent);

            while (true)
            {
                EndSize = StringFromBytes.IndexOf("||", (int)CurrentSearchPosition);
                FileLength = long.Parse(StringFromBytes.Substring((int)CurrentSearchPosition, EndSize - (int)CurrentSearchPosition)); // the file size is written in the bytes from position 22 in the header to EndSize

                StartFileName = EndSize + 2;
                EndFileName = StringFromBytes.IndexOf("*|*", StartFileName);
                FileName = StringFromBytes.Substring(StartFileName, EndFileName - StartFileName); // readout file name
             
                CurrentPosition += EmptyHeaderBytes.Length + Encoder.GetBytes(FileLength.ToString()).Length + Encoder.GetBytes(FileName).Length;
               
                NormalFileStream = new FileStream(decompressPath + "\\" + FileName, FileMode.Create);
                NormalFileStream.Write(BigFileContent, (int)CurrentPosition, (int)FileLength);

                CurrentPosition += FileLength;
                CurrentSearchPosition = CurrentPosition + 22;
                NormalFileStream.Close();

                if (CurrentSearchPosition > BigFileContent.Length)
                    break;
            }

            DecompressStream.Close();
        }

No comments:

Post a Comment