HTTrack Website Copier

71 pointsposted 13 hours ago
by iscream26

15 Comments

Felk

9 hours ago

Funny seeing this here now, as I _just_ finished archiving an old MyBB PHP forum. Though I used `wget` and it took 2 weeks and 260GB of uncompressed disk space (12GB compressed with zstd), and the process was not interruptible and I had to start over each time my hard drive got full. Maybe I should have given HTTrack a shot to see how it compares.

If anyone wanna know the specifics on how I used wget, I wrote it down here: https://github.com/SpeedcubeDE/speedcube.de-forum-archive

Also, if anyone has experience archiving similar websites with HTTrack and maybe know how it compares to wget for my use case, I'd love to hear about it!

smashed

8 hours ago

I've tried both in order to archive EOL websites and I've had better luck with wget, it seems to recognize more links/resources and do a better job so it was probably not a bad choice.

xnx

11 hours ago

Great tool. Does it still work for the "modern" web (i.e. now that even simple/content websites have become "apps")?

alganet

10 hours ago

Nope. It is for the classic web (the only websites worth saving anyway).

freedomben

9 hours ago

Even for classic web, if it's behind cloudflare, then HTTrack no longer works.

It's a sad point to be at. Fortunately, the single file extension still works really well for single pages, even when they are built dynamically by JavaScript on the client side. There isn't a solution for cloning an entire site though, at least that I know of

alganet

8 hours ago

If it is cloudflare human verification, then httrack will have an issue. But in the end it's just a cookie, you can use a browser with JS to grab the cookie, then feed it to httrack headers.

If cloudflare ddos protection is an issue, you can throttle httrack requests.

acheong08

6 hours ago

> you can use a browser with JS to grab the cookie, then feed it to httrack headers

They also check your user agent, IP and JA3 fingerprint (and ensures it matches with the one that got the cookie) so it's not as simple as copying some cookies. This might just be for paying customers though since it doesn't do such heavy checks for some sites

corinroyal

10 hours ago

One time I was trying to create an offline backup of a botanical medicine site for my studies. Somehow I turned off depth of link checking and made it follow offsite links. I forgot about it. A few days later the machine crashed due to a full disk from trying to cram as much of the WWW as it could on there.

oriettaxx

8 hours ago

I don't get it: last release 2017 while in github I see more releases...

so, did developer of the github repo took over and updating/upgrading? very good!

subzero06

4 hours ago

i use this to double check which of my web app folder/files are publicly accessible.

woutervddn

3 hours ago

Also known as: static site generator for any original website platform...

dark-star

11 hours ago

oh wow that brings back memories. I have used httrack in the late 90s and early 2000's to mirror interesting websites from the early internet, over a modem connection (and early DSL)

Good to know they're still around, however, now that the web is much more dynamic I guess it's not as useful anymore as it was back then

dspillett

8 hours ago

> now that the web is much more dynamic I guess it's not as useful anymore as it was back then

Also less useful because the web is so easy to access, I remember using it back then to draw things down over the university link for reference in my room (1st year, no network access at all in rooms) or house (or per-minute costed modem access).

Sites can vanish easily of course still these days, so having a local copy could be a bonus, but they just as likely go out of date or get replaced, and if not are usually archived elsewhere already.