carfield.com.hk: wget.txt

Tue Dec 20 05:20:44 GMT 2005

wget

The answer is wget. It can be used to download just single file, a list of specified files, or a recursive chain of files. For example, the following command will download an entire site, following all links as long as they stay in the same domain.

wget -r http://www.jlamp.com/

This command will do the same but include referenced CSS, inline images, etc.

wget -p -r http://www.jlamp.com/

In this case, I wanted all files with extension "mp3", skipping everything else. My first thought was to use the -A option to only "Accept" and download mp3 files.

wget -p -r -A mp3 http://www.escapepod.org/

The problem though is that Escape Pod, like many podcast sites, have their actual mp3 files hosted by a third party in order to reduce bandwidth. I could do the recursive download across domans but thought that might get a bit dangerous.

In the end, I ran three commands. The first downloads the html files for the entire site. The second line scans the html for full URL's and uses sed to filter out everything else. (If I knew sed better this could probably be a shorter command). Note the use of find to navigate all files in the tree, egrep to restrict to actual URL's, sed to eliminate irrelevant parts of the line, and sort/uniq to remove duplicates.

Finally, the third line uses wget to download all files found in the previous command. (Note: remember the ending \ causes the command to extend to the next line).

wget -r -A htm,html http://www.escapepod.org/

cat `find . -name \*htm\* -print` | egrep "http.*mp3" | \
sed "s/.*$http:\/\/.*mp3$.*$/\1/" | sort | uniq > files.txt

wget -i files.txt

http://www.jlamp.com/blog/2005/12/17/1134883853233.html

(google search) (amazon search)
second