- RNC Organizer: Doing Public Relations for Burma and the Republican National Convention = Teh Awkward (11)
- The cell phone cube of silence; Feds get yr location data without warrants; banned 9-11 blogger KillTown goes too far, scares th (10)
- Sy Hersh: Covert war in Iran escalates: Baluchis used as pawns in risky scheme, Special Ops out of control (9)
- NSA/FBI fun; Spook 411 prank: Cryptome lists all damn fake White House/CIA/NSA phone numbers; Obama/Hillary Denver fight fantasy (7)
- More Drupal Links; How to rock theme development; Drupal 6 the latest bits; speed up page loads, and such (5)
How to scrape an entire website onto your hard drive
I got a new hard drive recently and decided to dump in some of those shady websites I know and love, for the purposes of building a sort of database that can be indexed and infer connections.
The easiest way to scrape a website is the command wget in the Terminal:
wget -m --tries=5 "http://www.debka.com"
which will scan through DEBKA's site and drop all the files into your current directory, inside another directory named www.debka.com. wget digs out the links to every page and file, and based on all the links it sees, it will clone all the files it can find.
The Tries=5 ensures it won't get stuck on some jammed file. The -m signifies mirroring the site. wget is also helpful for getting files from places that download badly, using
wget -c http://www.whatever.com/fatfile.zip
wget can also spoof what browser it reports itself as to the server. just use
wget --user-agent="Mozilla 4.0" -m http://whatever.com
when you need to pretend to be another kind of browser, because some of them try to block wget.
DEBKAfile is a site run by crusty Israeli intelligence officers and their scheming friends. There are lots of stories about various figures in the mideast world. I certainly wouldn't believe everything on a site like that, but certainly it is useful to know what crusty Israeli intel types want to publicize about Mugniyeh and the other kingpins they like to prattle on about.
If you are interested in an organization or individual that might get discredited or scandalized, the information published on the site might be short-lived, so it is good to get dumps or scrapes of sites before they get scrubbed by Public Relations flacks. Like how all congressment deleted their pictures with Mark Foley.
So I am going to yank down some choice bits of the internet and see if the data-mining software I've got spits out anything interesting. Solid. It's all public stuff, though, just snooping through like any other search engine. I'm just cloning some sites to make my own little haystack and see if there are any needles.




Recent comments
10 hours 26 min ago
10 hours 26 min ago
10 hours 26 min ago
10 hours 26 min ago
1 day 1 hour ago
1 day 1 hour ago
1 day 1 hour ago
1 day 1 hour ago
1 day 4 hours ago
1 day 4 hours ago