Tar (file format)
From Wikipedia, the free encyclopedia
- The correct title of this article is tar (file format). The initial letter is shown capitalized due to technical restrictions.
Tar | |
![]() tar 1.16 showing three types of packages. |
|
File extension: | .tar |
---|---|
MIME type: | application/x-tar |
Uniform Type Identifier: | public.tar-archive |
Magic: | ustar at byte 257 |
Type of format: | file archive |
Container for: | anything |
Contained by: | compress, gzip, bzip2 |
In computing, tar (derived from tape archive) is both file format (in the form of a type of archive bitstream) and the name of the program used to handle such files. The format was standardized by POSIX.1-1998 and later POSIX.1-2001. Initially developed as a raw format, used for tape backup and other sequential access devices for backup purposes, it is now commonly used to collate collections of files into one larger file, for distribution or archiving, while preserving file system information such as user and group permissions, dates, and directory structures.
tar
's linear roots can still be seen in its ability to work on any data stream and its slow partial extraction performance, as it has to read through the whole archive to extract only the final file. A tar file (somefile.tar
), when subsequently compressed using a compression utility such as gzip, bzip or formerly, compress, produces a compressed tar file with a filename extension indicating the type of compression (e.g.: somefile.tar.gz
). A .tar file is commonly referred to as a tarball, which may be compressed or not.
As is common with Unix utilities, tar
is a single specialist program. It follows the Unix philosophy in that it can "do only one thing" (archive), "but do it well". tar
is most commonly used in tandem with an external compression utility, since it has no built-in data compression facilities. These compression utilities generally only compress a single file, hence the pairing with tar, which can produce a single file from many files. To ease this common usage, the BSD and GNU versions of tar support the command line options -z
(gzip), -j
(bzip2), and -Z
(compress), which will compress or decompress the archive file it is currently working with, although even in this case the (de)compression is still actually performed by an external program. Compression is sometimes avoided because of the greatly amplified potential for damage to data in long term storage.
Contents |
[edit] Format details
A tar file is the concatenation of one or more files. Each file is preceded by a header block. The file data is written unaltered except that its length is rounded up to a multiple of 512 bytes and the extra space is zero filled. The end of an archive is marked by at least two consecutive zero-filled blocks.
A limitation of early tape drives was that data could only be written to them in 512 byte blocks. As a result data in tar files is arranged in 512 byte blocks.
[edit] File header
The file header block contains metadata about a file. To ensure portability across different architectures with different byte orderings, the information in the header block is encoded in ASCII. Thus if all the files in an archive are text files, then the archive is essentially an ASCII file.
The fields defined by the original Unix tar format are listed in the table below. When a field is unused it is zero filled. The header is padded with zero bytes to make it up to a 512 byte block.
Field Offset | Field Size | Field |
---|---|---|
0 | 100 | File name |
100 | 8 | File mode |
108 | 8 | Owner user ID |
116 | 8 | Group user ID |
124 | 12 | File size in bytes |
136 | 12 | Last modification time |
148 | 8 | Check sum for header block |
156 | 1 | Link indicator |
157 | 100 | Name of linked file |
The Link indicator field can have the following values:
Value | Meaning |
---|---|
0 | Normal file |
(ASCII NUL)[1] | Normal file |
1 | Hard link |
2 | Symbolic link[2] |
3 | Character special |
4 | Block special |
5 | Directory |
6 | FIFO |
7 | Contiguous file[3] |
A directory is also indicated by having a trailing slash(/) in the name.
For historical reasons numerical values are encoded in octal with leading zeroes. The final character is either a null or a space. Thus although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabytes on archived files. To overcome this limitation some versions of tar, including the GNU implementation, support an extension in which the file size is encoded in binary. Additionally, versions of GNU tar from 1999 and before pad the values with space characters instead of zero characters.
The checksum is calculated by taking the sum of the byte values of the header block with the eight checksum bytes taken to be ascii spaces (value 32). It is stored as a six digit octal number with leading zeroes followed by a nul and then a space.
[edit] USTAR format
Most modern tar programs read and write archives in the new USTAR (Uniform Standard Tape Archive) format, which has an extended header definition as defined by the POSIX (IEEE P1003.1) standards group. Older tar programs will ignore the extra information, while newer programs will test for the presence of the "ustar" string to determine if the new format is in use. The USTAR format allows for longer file names and stores extra information about each file.
Field Offset | Field Size | Field |
---|---|---|
0 | 156 | (as in old format) |
156 | 1 | Type flag |
157 | 100 | (as in old format) |
257 | 6 | USTAR indicator |
263 | 2 | USTAR version |
265 | 32 | Owner user name |
297 | 32 | Owner group name |
329 | 8 | Device major number |
337 | 8 | Device minor number |
345 | 155 | Filename prefix |
[edit] Example
The example below shows the hex dump of a header block from a tar file created using the GNU tar program. It was dumped with the od program. The "ustar" magic string can be seen, meaning that the tar file is in USTAR format.
0000000 e t c / p a s s w d nul nul nul nul nul nul 0000020 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul * 0000140 nul nul nul nul 0 1 0 0 6 4 4 nul 0 0 0 0 0000160 0 0 0 nul 0 0 0 0 0 0 0 nul 0 0 0 0 0000200 0 0 4 1 3 5 5 nul 1 0 1 5 5 0 6 1 0000220 1 0 5 nul 0 1 1 5 5 6 nul sp 0 nul nul nul 0000240 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul * 0000400 nul u s t a r sp sp nul r o o t nul nul nul 0000420 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul 0000440 nul nul nul nul nul nul nul nul nul r o o t nul nul nul 0000460 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul * 0001000
Note, the OpenBSD 3.7 tar does not have the 2 space characters after ustar. They are nul characters.
[edit] Tarbombs
Tarbomb is derogatory hacker slang used to refer to a tarball containing files that untar to the current directory instead of untarring into a directory of their own. This can be a potential problem if it overwrites files using the same name in the current directory. It can also be a pain for the user who then needs to delete all the files that are scattered over the directory amongst other files. Oftentimes this ends up happening in the user's home directory. Such behaviour is often considered bad etiquette on the part of the archive's creator.
[edit] Tarpit
Tarpit is a term to describe a method of revision control where a tar is used to capture the state of development of a software module at a particular point in time. The use of a tarpit typically loosely mirrors the use of a Revision control software tag and branching through the use of descriptive names.
[edit] Notes
- ^ This is probably a workaround for buggy tar implementations (the byte 0x00 is ASCII NUL).
- ^ GNU tar's headers mark this field as "Reserved"[1]
- ^ Apparently relevant on an OS called RTU, this would be a normal file written in one contiguous section on-disc. GNU tar's headers mark this field as 'Reserved', and such items will probably be extracted as normal files on other operating systems.
[edit] See also
[edit] External links
- The tar Command by The Linux Information Project (LINFO)
- Official website of GNU tar
- The file 'tar.h' from GNU tar
- Detailed information on tar and USTAR file headers
- linux tar command simplified
- tar(1) man page via OpenBSD
History: GNU Manifesto • GNU Project • Free Software Foundation (FSF)
GNU licenses: GNU General Public License (GPL) • GNU Lesser General Public License (LGPL) • GNU Free Documentation License (FDL)
Software: GNU operating system • bash • GNU Compiler Collection • Emacs • GNU C Library • Coreutils • GNU build system • other GNU packages and programs
Speakers: Robert J. Chassell • Loïc Dachary • Ricardo Galli • Georg C. F. Greve • Federico Heinz • Bradley M. Kuhn • Eben Moglen • Richard Stallman • Len Tower