Spiderman with wget

Filed Under c0de | Comments Off

Through many different needs, I’ve learned many different things. Occasionally, the need arises that I have to mirror a site for some reason. I’ve downloaded and tried many programs, utilities and other little gadgets to pull down a site, only to find that it then takes some work to get those files into a usable format to place on a webserver somewhere. One of the best books I’ve purchased is ‘Spidering Hacks‘ from O’Reilly. It covers a range of different techniques to fetch data from other places. There’s some really cool scripts in there, with my favorite part – ‘Hack the hack’ where you’re encouraged to expand on what you just learned. Although you can jump in the book at any point, many of the later hacks are built on and include references to previous ones.

Hack #27 [ More Advanced wget Techniques ] builds on Hack#26 [ Downloading with curl and wget ]. Basically outlining the power of this little command line utility. If you want a pretty GUI interface, this isn’t for you. If you need the power and elegance of a unix (like) command then this is your tool.

For windows users, you can download GNU wget from – the official site Once installed, add it to your path and you’re good to go. Here’s the ’snag a site’ command syntax, that will spider through an entire site and mirror it on your system.
wget –mirror –accept=html,css,js,jpg,gif http://www.somesite.com

Obviously if you want to include .htm or other filetypes – .mp3, .avi etc. you’ll have to add them here.

There are a mess of other options – the documentation lists them all so have fun and mirror away.

Comments

Comments are closed.

Name (required)

Email (required)

Website

Speak your mind