Mirroring your website to your local PC with Wget


There is nothing worse for a site owner to endure than to have his site hacked with no backup to restore from. Many people rely on the hosting providers backup feature or if unavailable make a copy themselves on a regular basis. Unfortunately, ‘Regular’ can mean weeks or months, depending on how serious the issue of security is for the site owner or webmaster. However people are not to blame; for most people data backup is a chore that needs to be get done with, much like flossing after a good meal.

We can however ease the backup process if we can automate it ourselves (tough luck with the flossing thing however). One of my favorite tools in this regard is Wget, a *nix command-line based web content retrieval program, also available for Windows.

A simple way to mirror your complete site with Wget is with the following command. You will need to specify your website ftp username and password.

wget --mirror ftp://user:pass@yourwebsite.com

This will make a copy of the complete site using the FTP protocol to your local machine. All the directory structure on the site is preserved locally. Now you have a nice copy of your site on your local PC. The advantage of Wget is that you can easily add this as a CRON job and be assured that important files are always kept in sync with your local PC.

Note that every option for Wget has a long form along with the short one. Long options are more convenient to remember, but take time to type. You may freely mix different option styles, or specify options after the command-line arguments. Long option names start with two hyphens while a short option name starts with a single hyphen.

The above code with the short option name.

wget -m ftp://user:pass@yourwebsite.com

However for large sites this can take a huge time or maybe you are only interested in a certain important directory that regularly changes. To selectively mirror a particular directory you can use the --no-parent or its short name -np option along with the directory path. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded. So for example if you need to regularly mirror your WordPress ‘uploads’ folder to your local PC, the following Wget command will do the trick.

wget --mirror --no-parent ftp://user:pass@website.com/wp-content/uploads

Many times you may be only interested in backing-up certain type of files, for which we can make use of the --accept or -A option. For example to download only ‘png’ files we can use the following command. Note that we have used the short option names to keep the command length small. Also note the single hyphen before the options.

wget -m -np -A png ftp://user:pass@yourwebsite.com/wp-content/uploads

One important note regarding the username and password. If any of these has a ‘#’ character you will have to specify the hex ‘%23′ string instead as ‘#’ is a shell comment character. For example if your ftp password is mypass#gh647, you will instead need to specify mypass%23gh647.

Wget is a complehensive tool with a number of other options to work with. As our primary goal in this post was to mirror our site, we only had a look at a few options. Check the Wget manual for a more comprehensive list of options.

 

 

 

Enhanced by Zemanta

This site is a digital habitat of Sameer Borate, a freelance web developer working in PHP, MySQL and WordPress. I also provide web scraping services, website design and development and integration of various Open Source API's. Contact me at metapix[at]gmail.com for any new project requirements and price quotes.

1 Response

1

Configuring Wget to Make a Readable Offline Copy of a WordPress.com Blog | Ray Woodcock's Latest

May 3rd, 2014 at 1:20 pm

[...] Another possibility was to use -A (alternately “–accept,” p.  22) to specify the only kinds of filename extensions or patterns that I wanted to mirror.  For instance, “-A html” would download only files with an .html extension.  While I could see that HTTrack had downloaded various blog posts as .html files, I suspected that this was part of HTTrack’s conversion process; the URLs for those posts did not end in .html.  So it did not seem that I would use this option to download my WordPress blogs. [...]

Your thoughts