wget(1)
NAME
wget - a utility to retrieve files from the World Wide Web
SYNOPSIS
wget [options] [URL-list]
WARNING
The information in this man page is an extract from the
full documentation of Wget. It is well out of date.
Please refer to the info page for full, up-to-date docu-
mentation. You can view the info documentation with the
Emacs info subsystem or the standalone info program.
DESCRIPTION
Wget is a utility designed for retrieving binary documents
across the Web, through the use of HTTP (Hyper Text Trans-
fer Protocol) and FTP (File Transfer Protocol), and saving
them to disk. Wget is non-interactive, which means it can
work in the background, while the user is not logged in,
unlike most of web browsers (thus you may start the pro-
gram and log off, letting it do its work). Analysing
server responses, it distinguishes between correctly and
incorrectly retrieved documents, and retries retrieving
them as many times as necessary, or until a user-specified
limit is reached. REST is used in FTP on hosts that sup-
port it. Proxy servers are supported to speed up the
retrieval and lighten network load.
Wget supports a full-featured recursion mechanism, through
which you can retrieve large parts of the web, creating
local copies of remote directory hierarchies. Of course,
maximum level of recursion and other parameters can be
specified. Infinite recursion loops are always avoided by
hashing the retrieved data. All of this works for both
HTTP and FTP.
The retrieval is conveniently traced with printing dots,
each dot representing one kilobyte of data received.
Builtin features offer mechanisms to tune which links you
wish to follow (cf. -L, -D and -H).
URL CONVENTIONS
Most of the URL conventions described in RFC1738 are sup-
ported. Two alternative syntaxes are also supported, which
means you can use three forms of address to specify a
file:
Normal URL (recommended form):
http://host[:port]/path
http://fly.cc.fer.hr/
ftp://ftp.xemacs.org/pub/xemacs/xemacs-19.14.tar.gz
ftp://username:password@host/dir/file
FTP only (ncftp-like): hostname:/dir/file
HTTP only (netscape-like):
hostname(:port)/dir/file
You may encode your username and/or password to URL using
the form:
ftp://user:password@host/dir/file
If you do not understand these syntaxes, just use the
plain ordinary syntax with which you would call lynx or
netscape. Note that the alternative forms are deprecated,
and may cease being supported in the future.
OPTIONS
There are quite a few command-line options for wget. Note
that you do not have to know or to use them unless you
wish to change the default behaviour of the program. For
simple operations you need no options at all. It is also a
good idea to put frequently used command-line options in
.wgetrc, where they can be stored in a more readable form.
This is the complete list of options with descriptions,
sorted in descending order of importance:
-h --help
Print a help screen. You will also get help if you
do not supply command-line arguments.
-V --version
Display version of wget.
-v --verbose
Verbose output, with all the available data. The
default output consists only of saving updates and
error messages. If the output is stdout, verbose is
default.
-q --quiet
Quiet mode, with no output at all.
-d --debug
Debug output, and will work only if wget was com-
piled with -DDEBUG. Note that when the program is
compiled with debug output, it is not printed
unless you specify -d.
-i filename --input-file=filename
Read URL-s from filename, in which case no URL-s
need to be on the command line. If there are URL-s
both on the command line and in a filename, those
on the command line are first to be retrieved. The
filename need not be an HTML document (but no harm
if it is) - it is enough if the URL-s are just
listed sequentially.
However, if you specify --force-html, the document
will be regarded as HTML. In that case you may have
problems with relative links, which you can solve
either by adding <base href="url"> to the document
or by specifying --base=url on the command-line.
-o logfile --output-file=logfile
Log messages to logfile, instead of default stdout.
Verbose output is now the default at logfiles. If
you do not wish it, use -nv (non-verbose).
-a logfile --append-output=logfile
Append to logfile - same as -o, but appends to a
logfile (or creating a new one if the old does not
exist) instead of rewriting the old log file.
-t num --tries=num
Set number of retries to num. Specify 0 for infi-
nite retrying.
--follow-ftp
Follow FTP links from HTML documents.
-c --continue-ftp
Continue retrieval of FTP documents, from where it
was left off. If you specify "wget -c ftp://sun-
site.doc.ic.ac.uk/ls-lR.Z", and there is already a
file named ls-lR.Z in the current directory, wget
continue retrieval from the offset equal to the
length of the existing file. Note that you do not
need to specify this option if the only thing you
want is wget to continue retrieving where it left
off when the connection is lost - wget does this by
default. You need this option when you want to con-
tinue retrieval of a file already halfway
retrieved, saved by other FTP software, or left by
wget being killed.
-g on/off --glob=on/off
Turn FTP globbing on or off. By default, globbing
will be turned on if the URL contains a globbing
characters (an asterisk, e.g.). Globbing means you
may use the special characters (wildcards) to
retrieve more files from the same directory at
once, like wget ftp://gnjilux.cc.fer.hr/*.msg.
Globbing currently works only on UNIX FTP servers.
-e command --execute=command
Execute command, as if it were a part of .wgetrc
file. A command invoked this way will take prece-
dence over the same command in .wgetrc, if there is
one.
-N --timestamping
Use the so-called time-stamps to determine whether
to retrieve a file. If the last-modification date
of the remote file is equal to, or older than that
of local file, and the sizes of files are equal,
the remote file will not be retrieved. This option
is useful for weekly mirroring of HTTP or FTP
sites, since it will not permit downloading of the
same file twice.
-F --force-html
When input is read from a file, force it to be
HTML. This enables you to retrieve relative links
from existing HTML files on your local disk, by
adding <base href> to HTML, or using --base.
-B base_href --base=base_href
Use base_href as base reference, as if it were in
the file, in the form <base href="base_href">. Note
that the base in the file will take precedence over
the one on the command-line.
-r --recursive
Recursive web-suck. According to the protocol of
the URL, this can mean two things. Recursive
retrieval of a HTTP URL means that Wget will down-
load the URL you want, parse it as an HTML document
(if an HTML document it is), and retrieve the files
this document is referring to, down to a certain
depth (default 5; change it with -l). Wget will
create a hierarchy of directories locally, corre-
sponding to the one found on the HTTP server.
This option is ideal for presentations, where slow
connections should be bypassed. The results will be
especially good if relative links were used, since
the pages will then work on the new location with-
out change.
When using this option with an FTP URL, it will
retrieve all the data from the given directory and
subdirectories, similar to HTTP recursive
retrieval.
You should be warned that invoking this option may
cause grave overloading of your connection. The
load can be minimized by lowering the maximal
recursion level (see -l) and/or by lowering the
number of retries (see -t).
-m --mirror
Turn on mirroring options. This will set recursion
and time-stamping, combining -r and -N.
-l depth --level=depth
Set recursion depth level to the specified level.
Default is 5. After the given recursion level is
reached, the sucking will proceed from the parent.
Thus specifying -r -l1 should equal a recursion-
less retrieve from file. Setting the level to zero
makes recursion depth (theoretically) unlimited.
Note that the number of retrieved documents will
increase exponentially with the depth level.
-H --span-hosts
Enable spanning across hosts when doing recursive
retrieving. See -r and -D. Refer to FOLLOWING LINKS
for a more detailed description.
-L --relative
Follow only relative links. Useful for retrieving a
specific homepage without any distractions, not
even those from the same host. Refer to FOLLOWING
LINKS for a more detailed description.
-D domain-list --domains=domain-list
Set domains to be accepted and DNS looked-up, where
domain-list is a comma-separated list. Note that it
does not turn on -H. This speeds things up, even if
only one host is spanned. Refer to FOLLOWING LINKS
for a more detailed description.
-A acclist / -R rejlist --accept=acclist /
--reject=rejlist
Comma-separated list of extensions to
accept/reject. For example, if you wish to download
only GIFs and JPEGs, you will use -A gif,jpg,jpeg.
If you wish to download everything except cumber-
some MPEGs and .AU files, you will use -R
mpg,mpeg,au.
-X list --exclude-directories list
Comma-separated list of directories to exclude from
FTP fetching.
-P prefix --directory-prefix=prefix
Set directory prefix ("." by default) to prefix.
The directory prefix is the directory where all
other files and subdirectories will be saved to.
-T value --timeout=value
Set the read timeout to a specified value. Whenever
a read is issued, the file descriptor is checked
for a possible timeout, which could otherwise leave
a pending connection (uninterrupted read). The
default timeout is 900 seconds (fifteen minutes).
-Y on/off --proxy=on/off
Turn proxy on or off. The proxy is on by default if
the appropriate environmental variable is defined.
-Q quota[KM] --quota=quota[KM]
Specify download quota, in bytes (default), kilo-
bytes or megabytes. More useful for rc file. See
below.
-O filename --output-document=filename
The documents will not be written to the appropri-
ate files, but all will be appended to a unique
file name specified by this option. The number of
tries will be automatically set to 1. If this file-
name is `-', the documents will be written to std-
out, and --quiet will be turned on. Use this option
with caution, since it turns off all the diagnos-
tics Wget can otherwise give about various errors.
-S --server-response
Print the headers sent by the HTTP server and/or
responses sent by the FTP server.
-s --save-headers
Save the headers sent by the HTTP server to the
file, before the actual contents.
--header=additional-header
Define an additional header. You can define more
than additional headers. Do not try to terminate
the header with CR or LF.
--http-user --http-passwd
Use these two options to set username and password
Wget will send to HTTP servers. Wget supports only
the basic WWW authentication scheme.
-nc Do not clobber existing files when saving to direc-
tory hierarchy within recursive retrieval of sev-
eral files. This option is extremely useful when
you wish to continue where you left off with
retrieval. If the files are .html or (yuck) .htm,
it will be loaded from the disk, and parsed as if
they have been retrieved from the Web.
-nv Non-verbose - turn off verbose without being com-
pletely quiet (use -q for that), which means that
error messages and basic information still get
printed.
-nd Do not create a hierarchy of directories when
retrieving recursively. With this option turned on,
all files will get saved to the current directory,
without clobbering (if a name shows up more than
once, the filenames will get extensions .n).
-x The opposite of -nd -- Force creation of a hierar-
chy of directories even if it would not have been
done otherwise.
-nh Disable time-consuming DNS lookup of almost all
hosts. Refer to FOLLOWING LINKS for a more detailed
description.
-nH Disable host-prefixed directories. By default,
http://fly.cc.fer.hr/ will produce a directory
named fly.cc.fer.hr in which everything else will
go. This option disables such behaviour.
--no-parent
Do not ascend to parent directory.
-k --convert-links
Convert the non-relative links to relative ones
locally.
FOLLOWING LINKS
Recursive retrieving has a mechanism that allows you to
specify which links wget will follow.
Only relative links
When only relative links are followed (option -L),
recursive retrieving will never span hosts. will
never get called, and the process will be very
fast, with the minimum strain of the network. This
will suit your needs most of the time, especially
when mirroring the output the output of *2html con-
verters, which generally produce only relative
links.
Host checking
The drawback of following the relative links solely
is that humans often tend to mix them with absolute
links to the very same host, and the very same
page. In this mode (which is the default), all URL-
s that refer to the same host will be retrieved.
The problem with this options are the aliases of
the hosts and domains. Thus there is no way for
wget to know that regoc.srce.hr and www.srce.hr are
the same hosts, or that fly.cc.fer.hr is the same
as fly.cc.etf.hr. Whenever an absolute link is
encountered, gethostbyname is called to check
whether we are really on the same host. Although
results of gethostbyname are hashed, so that it
will never get called twice for the same host, it
still presents a nuisance e.g. in the large indexes
of difference hosts, when each of them has to be
looked up. You can use -nh to prevent such complex
checking, and then wget will just compare the host-
name. Things will run much faster, but also much
less reliable.
Domain acceptance
With the -D option you may specify domains that
will be followed. The nice thing about this option
is that hosts that are not from those domains will
not get DNS-looked up. Thus you may specify
-Dmit.edu, just to make sure that nothing outside
.mit.edu gets looked up . This is very important
and useful. It also means that -D does not imply -H
(it must be explicitly specified). Feel free to use
this option, since it will speed things up greatly,
with almost all the reliability of host checking of
all hosts.
Of course, domain acceptance can be used to limit
the retrieval to particular domains, but freely
spanning hosts within the domain, but then you must
explicitly specify -H.
All hosts
When -H is specified without -D, all hosts are
being spanned. It is useful to set the recursion
level to a small value in those cases. Such option
is rarely useful.
FTP The rules for FTP are somewhat specific, since they
have to be. To have FTP links followed from HTML
documents, you must specify -f (follow_ftp). If you
do specify it, FTP links will be able to span hosts
even if span_hosts is not set. Option rela-
tive_only (-L) has no effect on FTP. However,
domain acceptance (-D) and suffix rules (-A/-R)
still apply.
STARTUP FILE
Wget supports the use of initialization file .wgetrc.
First a system-wide init file will be looked for
(/usr/local/lib/wgetrc by default) and loaded. Then the
user's file will be searched for in two places: In the
environmental variable WGETRC (which is presumed to hold
the full pathname) and $HOME/.wgetrc. Note that the set-
tings in user's startup file may override the system set-
tings, which includes the quota settings (he he).
The syntax of each line of startup file is simple:
variable = value
Valid values are different for different variables. The
complete set of commands is listed below, the letter after
equation-sign denoting the value the command takes. It is
on/off for on or off (which can also be 1 or 0), string
for any string or N for positive integer. For example,
you may specify "use_proxy = off" to disable use of proxy
servers by default. You may use inf for infinite value
(the role of 0 on the command line), where appropriate.
The commands are case-insensitive and underscore-insensi-
tive, thus DIr__Prefix is the same as dirprefix. Empty
lines, lines consisting of spaces, or lines beginning with
'#' are skipped.
Most of the commands have their equivalent command-line
option, except some more obscure or rarely used ones. A
sample init file is provided in the distribution, named
sample.wgetrc.
accept/reject = string
Same as -A/-R.
add_hostdir = on/off
Enable/disable host-prefixed hostnames. -nH dis-
ables it.
always_rest = on/off
Enable/disable continuation of the retrieval, the
same as -c.
base = string
Set base for relative URL-s, the same as -B.
convert links = on/off
Convert non-relative links locally. The same as -k.
debug = on/off
Debug mode, same as -d.
dir_mode = N
Set permission modes of created subdirectories
(default is 755).
dir_prefix = string
Top of directory tree, the same as -P.
dirstruct = on/off
Turning dirstruct on or off, the same as -x or -nd,
respectively.
domains = string
Same as -D.
follow_ftp = on/off
Follow FTP links from HTML documents, the same as
-f.
force_html = on/off
If set to on, force the input filename to be
regarded as an HTML document, the same as -F.
ftp_proxy = string
Use the string as FTP proxy, instead of the one
specified in environment.
glob = on/off
Turn globbing on/off, the same as -g.
header = string
Define an additional header, like --header.
http_passwd = string
Set HTTP password.
http_proxy = string
Use the string as HTTP proxy, instead of the one
specified in environment.
http_user = string
Set HTTP user.
input = string
Read the URL-s from filename, like -i.
kill_longer = on/off
Consider data longer than specified in content-
length header as invalid (and retry getting it).
The default behaviour is to save as much data as
there is, provided there is more than or equal to
the value in content-length.
logfile = string
Set logfile, the same as -o.
login = string
Your user name on the remote machine, for FTP.
Defaults to "anonymous".
mirror = on/off
Turn mirroring on/off. The same as -m.
noclobber = on/off
Same as -nc.
no_parent = on/off
Same as --no-parent.
no_proxy = string
Use the string as the comma-separated list of
domains to avoid in proxy loading, instead of the
one specified in environment.
num_tries = N
Set number of retries per URL, the same as -t.
output_document = string
Set the output filename, the same as -O.
passwd = string
Your password on the remote machine, for FTP.
Defaults to username@hostname.domainname.
quiet = on/off
Quiet mode, the same as -q.
quota = quota
Specify the download quota, which is useful to put
in /usr/local/lib/wgetrc. When download quota is
specified, wget will stop retrieving after the
download sum has become greater than quota. The
quota can be specified in bytes (default), kbytes
('k' appended) or mbytes ('m' appended). Thus
"quota = 5m" will set the quota to 5 mbytes. Note
that the user's startup file overrides system set-
tings.
reclevel = N
Recursion level, the same as -l.
recursive = on/off
Recursive on/off, the same as -r.
relative_only = on/off
Follow only relative links (the same as -L). Refer
to section FOLLOWING LINKS for a more detailed
description.
robots = on/off
Use (or not) robots.txt file.
server_response = on/off
Choose whether or not to print the HTTP and FTP
server responses, the same as -S.
simple_host_check = on/off
Same as -nh.
span_hosts = on/off
Same as -H.
timeout = N
Set timeout value, the same as -T.
timestamping = on/off
Turn timestamping on/off. The same as -N.
use_proxy = on/off
Turn proxy support on/off. The same as -Y.
verbose = on/off
Turn verbose on/off, the same as -v/-nv.
SIGNALS
Wget will catch the SIGHUP (hangup signal) and ignore it.
If the output was on stdout, it will be redirected to a
file named wget-log. This is also convenient when you wish
to redirect the output of Wget interactively.
$ wget http://www.ifi.uio.no/~larsi/gnus.tar.gz &
$ kill -HUP %% # to redirect the output
Wget will not try to handle any signals other than SIGHUP.
Thus you may interrupt Wget using ^C or SIGTERM.
EXAMPLES
Get URL http://fly.cc.fer.hr/:
wget http://fly.cc.fer.hr/
Force non-verbose output:
wget -nv http://fly.cc.fer.hr/
Unlimit number of retries:
wget -t0 http://www.yahoo.com/
Create a mirror image of fly's web (with the same directory structure
the original has), up to six recursion levels, with only one try per
document, saving the verbose output to log file 'log':
wget -r -l6 -t1 -o log http://fly.cc.fer.hr/
Retrieve from yahoo host only (depth 50):
wget -r -l50 http://www.yahoo.com/
ENVIRONMENT
http_proxy, ftp_proxy, no_proxy, WGETRC, HOME
FILES
/usr/local/lib/wgetrc, $HOME/.wgetrc
UNRESTRICTIONS
Wget is free; anyone may redistribute copies of Wget to
anyone under the terms stated in the General Public
License, a copy of which accompanies each copy of Wget.
SEE ALSO
lynx(1) ftp(1)
AUTHOR
Hrvoje Niksic lt;hniksic@srce.hr is the author of Wget.
Thanks to the beta testers and all the other people who
helped with useful suggestions.