Felk
9 months ago
Funny seeing this here now, as I _just_ finished archiving an old MyBB PHP forum. Though I used `wget` and it took 2 weeks and 260GB of uncompressed disk space (12GB compressed with zstd), and the process was not interruptible and I had to start over each time my hard drive got full. Maybe I should have given HTTrack a shot to see how it compares.
If anyone wanna know the specifics on how I used wget, I wrote it down here: https://github.com/SpeedcubeDE/speedcube.de-forum-archive
Also, if anyone has experience archiving similar websites with HTTrack and maybe know how it compares to wget for my use case, I'd love to hear about it!
smashed
9 months ago
I've tried both in order to archive EOL websites and I've had better luck with wget, it seems to recognize more links/resources and do a better job so it was probably not a bad choice.
fmajid
9 months ago
Conversely, httrack was the only tool that could archive the JS-heavy microsite my realtor made to sell our old house. The command-line interface is horrendous, but it does handle rewriting complex sites better than wget does.
codetrotter
9 months ago
> it took 2 weeks and 260GB of uncompressed disk space
Is most of that data because of there being like a zillion different views and sortings of the same posts? That’s been the main difficulty for me when wanting to crawl some sites. There’s like an infinite number of permutations of URLs with different parameters because every page has a bunch of different link with auto-generated URL parameters for various things, that results in often retrieving the same data over and over and over again throughout an attempted crawl. And sometimes URL parameters are needed and sometimes not so it’s not like you can just strip all URL parameters either.
So then you start adding things to your crawler like, starting with shortest URLs first, and then maybe you make it so whenever you pick the next URL to visit it will take one that is most different from what you’ve seen so far. And after that you start adding super specific rules for different paths of a specific site.
Felk
9 months ago
The slowdown wasn't due to a lot of permutations, but mostly because a) wget just takes a considerable amount of time to process large HTML files with lots of links, and b) MyBB has a "threaded mode", where each post of a thread geht's a dedicated page with links to all other posts of that thread. The largest thread had around 16k posts, so that's 16k² URLs to parse.
In terms of possible permutations, MyBB is pretty tame thankfully. Only the forums are sortable, posts only have the regular and the aforementioned threaded mode to view them. Even the calender widget only goes from 1901-2030, otherwise wget might have crawled forever.
I originally considered excluding threaded mode using wget's `--reject-regex` and then just adding an nginx rule later to redirect any incoming such links to the normal view mode. Basically just saying "fuck it, you only get this version". That might be worth a try for your case
criddell
9 months ago
Is there a friendly way to do this? I'd feel bad burning through hundreds of gigabytes of bandwidth for a non-corporate site. Would a database snapshot be as useful?
z33k
9 months ago
MyBB PHP forums have a web interface through which one can download the database as a single .sql file. It will most likely be a mess, depending on the addons that were installed on the forum.
Felk
9 months ago
Downloading a DB dump and crawling locally is possible, but had two gnarly show stoppers for me using wget: the forum's posts often link to other posts, and those links are absolute. Getting wget to crawl those links through localhost is hardly easy (local reverse proxy with content rewriting?). Second, the forum and its server were really unmaintained. I didn't want to spend a lot of time replicating it locally and just archive it as-is while it is still barely running
dbtablesorrows
9 months ago
If you want to customize the scraping, there's scrapy python framework. You would always need to download the html though.
squigz
9 months ago
Isn't bandwidth mostly dirt cheap/free these days?
begrid
9 months ago
wget2 has an option por paralel downloading. https://github.com/rockdaboot/wget2