TAR(5) | File Formats Manual | TAR(5) |
tar
— format of
tape archive files
The tar
archive format collects any number
of files, directories, and other file system objects (symbolic links, device
nodes, etc.) into a single stream of bytes. The format was originally
designed to be used with tape drives that operate with fixed-size blocks,
but is widely used as a general packaging mechanism.
A tar
archive consists of a series of
512-byte records. Each file system object requires a header record which
stores basic metadata (pathname, owner, permissions, etc.) and zero or more
records containing any file data. The end of the archive is indicated by two
records consisting entirely of zero bytes.
For compatibility with tape drives that use fixed block sizes,
programs that read or write tar files always read or write a fixed number of
records with each I/O operation. These “blocks” are always a
multiple of the record size. The maximum block size supported by early
implementations was 10240 bytes or 20 records. This is still the default for
most implementations although block sizes of 1MiB (2048 records) or larger
are commonly used with modern high-speed tape drives. (Note: the terms
“block” and “record” here are not entirely
standard; this document follows the convention established by John Gilmore
in documenting pdtar
.)
The original tar archive format has been extended many times to include additional information that various implementors found necessary. This section describes the variant implemented by the tar command included in Version 7 AT&T UNIX, which seems to be the earliest widely-used version of the tar program.
The header record for an old-style tar
archive consists of the following:
struct header_old_tar { char name[100]; char mode[8]; char uid[8]; char gid[8]; char size[12]; char mtime[12]; char checksum[8]; char linkflag[1]; char linkname[100]; char pad[255]; };
Early tar implementations varied in how they terminated these fields. The tar command in Version 7 AT&T UNIX used the following conventions (this is also documented in early BSD manpages): the pathname must be null-terminated; the mode, uid, and gid fields must end in a space and a null byte; the size and mtime fields must end in a space; the checksum is terminated by a null and a space. Early implementations filled the numeric fields with leading spaces. This seems to have been common practice until the IEEE Std 1003.1-1988 (“POSIX.1”) standard was released. For best portability, modern implementations should fill the numeric fields with leading zeros.
An early draft of IEEE Std 1003.1-1988
(“POSIX.1”) served as the basis for John Gilmore's
pdtar
program and many system implementations from
the late 1980s and early 1990s. These archives generally follow the POSIX
ustar format described below with the following variations:
IEEE Std 1003.1-1988 (“POSIX.1”) defined a standard tar file format to be read and written by compliant implementations of tar(1). This format is often called the “ustar” format, after the magic value used in the header. (The name is an acronym for “Unix Standard TAR”.) It extends the historic format with new fields:
struct header_posix_ustar { char name[100]; char mode[8]; char uid[8]; char gid[8]; char size[12]; char mtime[12]; char checksum[8]; char typeflag[1]; char linkname[100]; char magic[6]; char version[2]; char uname[32]; char gname[32]; char devmajor[8]; char devminor[8]; char prefix[155]; char pad[12]; };
Note that all unused bytes must be set to
NUL
.
Field termination is specified slightly differently by POSIX than
by previous implementations. The magic,
uname, and gname fields must
have a trailing NUL
. The
pathname, linkname, and
prefix fields must have a trailing
NUL
unless they fill the entire field. (In
particular, it is possible to store a 256-character pathname if it happens
to have a / as the 156th character.) POSIX requires
numeric fields to be zero-padded in the front, and requires them to be
terminated with either space or NUL
characters.
Currently, most tar implementations comply with the ustar format, occasionally extending it by adding new fields to the blank area at the end of the header record.
There have been several attempts to extend the range of sizes or times supported by modifying how numbers are stored in the header.
One obvious extension to increase the size of files is to eliminate the terminating characters from the various numeric fields. For example, the standard only allows the size field to contain 11 octal digits, reserving the twelfth byte for a trailing NUL character. Allowing 12 octal digits allows file sizes up to 64 GB.
Another extension, utilized by GNU tar, star, and other newer
tar
implementations, permits binary numbers in the
standard numeric fields. This is flagged by setting the high bit of the
first byte. The remainder of the field is treated as a signed
twos-complement value. This permits 95-bit values for the length and time
fields and 63-bit values for the uid, gid, and device numbers. In
particular, this provides a consistent way to handle negative time values.
GNU tar supports this extension for the length, mtime, ctime, and atime
fields. Joerg Schilling's star program and the libarchive library support
this extension for all numeric fields. Note that this extension is largely
obsoleted by the extended attribute record provided by the pax interchange
format.
Another early GNU extension allowed base-64 values rather than octal. This extension was short-lived and is no longer supported by any implementation.
There are many attributes that cannot be portably stored in a POSIX ustar archive. IEEE Std 1003.1-2001 (“POSIX.1”) defined a “pax interchange format” that uses two new types of entries to hold text-formatted metadata that applies to following entries. Note that a pax interchange format archive is a ustar archive in every respect. The new data is stored in ustar-compatible archive entries that use the “x” or “g” typeflag. In particular, older implementations that do not fully support these extensions will extract the metadata into regular files, where the metadata can be examined as necessary.
An entry in a pax interchange format archive consists of one or two standard ustar entries, each with its own header and data. The first optional entry stores the extended attributes for the following entry. This optional first entry has an "x" typeflag and a size field that indicates the total size of the extended attributes. The extended attributes themselves are stored as a series of text-format lines encoded in the portable UTF-8 encoding. Each line consists of a decimal number, a space, a key string, an equals sign, a value string, and a new line. The decimal number indicates the length of the entire line, including the initial length field and the trailing newline. An example of such a field is:
25
ctime=1084839148.1212\n
atime
,
ctime
, mtime
hdrcharset
uname
,
uid
, gname
,
gid
linkpath
path
realtime.*
,
security.*
size
SCHILY.*
star
implementation.SCHILY.acl.access
,
SCHILY.acl.default
,
SCHILY.acl.ace
SCHILY.devminor
,
SCHILY.devmajor
SCHILY.fflags
SCHILY.realsize
SCHILY.dev
,
SCHILY.ino
,
SCHILY.nlinks
SCHILY.*
extensions can store all of
the data from struct stat.LIBARCHIVE.*
libarchive
library and programs that use it.LIBARCHIVE.creationtime
LIBARCHIVE.xattr
.namespace.keyVENDOR.*
Any values stored in an extended attribute override the corresponding values in the regular tar header. Note that compliant readers should ignore the regular fields when they are overridden. This is important, as existing archivers are known to store non-compliant values in the standard header fields in this situation. There are no limits on length for any of these fields. In particular, numeric fields can be arbitrarily large. All text fields are encoded in UTF8. Compliant writers should store only portable 7-bit ASCII characters in the standard ustar header and use extended attributes whenever a text value contains non-ASCII characters.
In addition to the x
entry described
above, the pax interchange format also supports a g
entry. The g
entry is identical in format, but
specifies attributes that serve as defaults for all subsequent archive
entries. The g
entry is not widely used.
Besides the new x
and
g
entries, the pax interchange format has a few
other minor variations from the earlier ustar format. The most troubling one
is that hardlinks are permitted to have data following them. This allows
readers to restore any hardlink to a file without having to rewind the
archive to find an earlier entry. However, it creates complications for
robust readers, as it is no longer clear whether or not they should ignore
the size field for hardlink entries.
The GNU tar program started with a pre-POSIX format similar to
that described earlier and has extended it using several different
mechanisms: It added new fields to the empty space in the header (some of
which was later used by POSIX for conflicting purposes); it allowed the
header to be continued over multiple records; and it defined new entries
that modify following entries (similar in principle to the
x
entry described above, but each GNU special entry
is single-purpose, unlike the general-purpose x
entry). As a result, GNU tar archives are not POSIX compatible, although
more lenient POSIX-compliant readers can successfully extract most GNU tar
archives.
struct header_gnu_tar { char name[100]; char mode[8]; char uid[8]; char gid[8]; char size[12]; char mtime[12]; char checksum[8]; char typeflag[1]; char linkname[100]; char magic[6]; char version[2]; char uname[32]; char gname[32]; char devmajor[8]; char devminor[8]; char atime[12]; char ctime[12]; char offset[12]; char longnames[4]; char unused[1]; struct { char offset[12]; char numbytes[12]; } sparse[4]; char isextended[1]; char realsize[12]; char pad[17]; };
Note that the "D" typeflag specifically violates POSIX, which requires that unrecognized typeflags be restored as normal files. In this case, restoring the "D" entry as a file could interfere with subsequent creation of the like-named directory.
struct gnu_sparse_header { struct { char offset[12]; char numbytes[12]; } sparse[21]; char isextended[1]; char padding[7]; };
M
type files, the current entry is only a portion
of the file. In that case, the POSIX size field will indicate the size of
this entry; the realsize field will indicate the
total size of the file.GNU tar 1.14 (XXX check this XXX) and later will write pax
interchange format archives when you specify the
--posix
flag. This format follows the pax
interchange format closely, using some SCHILY
tags
and introducing new keywords to store sparse file information. There have
been three iterations of the sparse file support, referred to as
“0.0”, “0.1”, and “1.0”.
GNU.sparse.numblocks
,
GNU.sparse.offset
,
GNU.sparse.numbytes
,
GNU.sparse.size
GNU.sparse.numblocks
attribute to indicate the
number of blocks in the file, a pair of
GNU.sparse.offset
and
GNU.sparse.numbytes
to indicate the offset and
size of each block, and a single GNU.sparse.size
to indicate the full size of the file. This is not the same as the size in
the tar header because the latter value does not include the size of any
holes. This format required that the order of attributes be preserved and
relied on readers accepting multiple appearances of the same attribute
names, which is not officially permitted by the standards.GNU.sparse.map
GNU.sparse.major
,
GNU.sparse.minor
,
GNU.sparse.name
,
GNU.sparse.realsize
GNU.sparse.major
and
GNU.sparse.minor
fields) and the full size of the
file. The GNU.sparse.name
holds the true name of
the file. To avoid confusion, the name stored in the regular tar header is
a modified name so that extraction errors will be apparent to users.XXX More Details Needed XXX
Solaris tar (beginning with SunOS XXX 5.7 ?? XXX) supports an “extended” format that is fundamentally similar to pax interchange format, with the following differences:
X
, not x
, as used by pax
interchange format. The detailed format of this entry appears to be the
same as detailed above for the x
entry.A
header is used to store an ACL for
the following regular entry. The body of this entry contains a seven-digit
octal number followed by a zero byte, followed by the textual ACL
description. The octal value is the number of ACL entries plus a constant
that indicates the ACL type: 01000000 for POSIX.1e ACLs and 03000000 for
NFSv4 ACLs.XXX More details needed XXX
AIX Tar uses a ustar-formatted header with the type
A
for storing coded ACL information. Unlike the
Solaris format, AIX tar writes this header after the regular file body to
which it applies. The pathname in this header is either
NFS4
or AIXC
to indicate the
type of ACL stored. The actual ACL is stored in platform-specific binary
format.
The tar distributed with Apple's Mac OS X stores most regular
files as two separate files in the tar archive. The two files have the same
name except that the first one has “._” prepended to the last
path element. This special file stores an AppleDouble-encoded binary blob
with additional metadata about the second file, including ACL, extended
attributes, and resources. To recreate the original file on disk, each
separate file can be extracted and the Mac OS X
copyfile
()
function can be used to unpack the separate metadata file and apply it to th
regular file. Conversely, the same function provides a “pack”
option to encode the extended metadata from a file into a separate file
whose contents can then be put into a tar archive.
Note that the Apple extended attributes interact badly with long filenames. Since each file is stored with the full name, a separate set of extensions needs to be included in the archive for each one, doubling the overhead required for files with long names.
The following list is a condensed summary of the type codes used in tar header records generated by different tar implementations. More details about specific implementations can be found above:
0
1
2
3
4
5
6
7
7
A
A
D
K
L
M
N
S
V
X
g
x
The tar
utility is no longer a part of
POSIX or the Single Unix Standard. It last appeared in
Version 2 of the Single UNIX Specification
(“SUSv2”). It has been supplanted in subsequent
standards by pax(1). The ustar format is currently part of
the specification for the pax(1) utility. The pax
interchange file format is new with IEEE Std 1003.1-2001
(“POSIX.1”).
A tar
command appeared in Seventh Edition
Unix, which was released in January, 1979. It replaced the
tp
program from Fourth Edition Unix which in turn
replaced the tap
program from First Edition Unix.
John Gilmore's pdtar
public-domain implementation
(circa 1987) was highly influential and formed the basis of
GNU tar
(circa 1988). Joerg Shilling's
star
archiver is another open-source (CDDL) archiver
(originally developed circa 1985) which features complete support for pax
interchange format.
This documentation was written as part of the
libarchive
and bsdtar
project by Tim Kientzle
⟨kientzle@FreeBSD.org⟩.
December 27, 2016 | Mac OS X 12 |