hedora
7 hours ago
It's frustrating that there's no way for people to (selectively) mirror the Internet Archive. $25-30M per year is a lot for a non-profit, but it's nothing for government agencies, or private corporations building Gen AI models.
I suspect having a few different teams competing (for funding) to provide mirrors would rapidly reduce the hardware cost too.
The density + power dissipation numbers quoted are extremely poor compared to enterprise storage. Hardware costs for the enterprise systems are also well below AWS (even assuming a short 5 year depreciation cycle on the enterprise boxes). Neither this article nor the vendors publish enough pricing information to do a thorough total cost of ownership analysis, but I can imagine someone the size of IA would not be paying normal margins to their vendors.
toomuchtodo
7 hours ago
Pick the items you want to mirror and seed them via their torrent file.
https://help.archive.org/help/archive-bittorrents/
https://github.com/jjjake/internetarchive
https://archive.org/services/docs/api/internetarchive/cli.ht...
u/stavros wrote a design doc for a system (codename "Elephant") that would scale this up: https://news.ycombinator.com/item?id=45559219
(no affiliation, I am just a rando; if you are a library, museum, or similar institution, ask IA to drop some racks at your colo for replication, and as always, don't forget to donate to IA when able to and be kind to their infrastructure)
billyhoffman
5 hours ago
There are real problems with the Torrent files for collections. They are automatically created when a collection is first created and uploaded, and so they only include the files of the initial upload. For very large collections (100+ GB) it is common for a creator to add/upload files into a collection in batches, but the torrent file is never regenerated, so download with the torrent results in just a small subset of the entire collection.
https://www.reddit.com/r/torrents/comments/vc0v08/question_a...
The solution is to use one of the several IA downloader script on GitHub, which download content via the collection's file list. I don't like directly downloading since I know that is most cost to IA, but torrents really are an option for some collections.
Turns out, there are a lot of 500BG-2TB collections for ROMs/ISOs for video game consoles through the 7th and 8th generation, available on the IA...
Wowfunhappy
5 hours ago
Is this something the Internet Archive could fix? I would have expected the torrent to get replaced when an upload is changed, maybe with some kind of 24 hour debounce.
rincebrain
4 hours ago
"They're working on it." [1]
It sounds like they put this mechanism into place that stops regenerating large torrents incrementally when it caused massive slowdowns for them, and haven't finished building something to automatically fix it, but will go fix individual ones on demand for now.
[1] - https://www.reddit.com/r/theinternetarchive/comments/1ij8go9...
textfiles
3 hours ago
It is on my desk to fix this soon.
nodja
6 hours ago
It's insane to me that in 2008 a bunch of pervs decentralized storage and made hentai@home to host hentai comics. Yet here we are almost 20 years later and we haven't generalized this solution. Yes I'm aware of the privacy issues h@h has (as a hoster you're exposing your real IP and people reading comics are exposing their IP to you) but those can be solved with tunnels, the real value is the redundant storage.
qingcharles
6 hours ago
The fact AI companies are stripping mining IA for content and not helping to be part of the solution is egregious.
philipkglass
7 hours ago
I would like to be able to pull content out of the Wayback Machine with a proper API [1]. I'd even be willing to pay a combination of per-request and per-gigabyte fees to do it. But then I think about the Archive's special status as a non-profit library, and I'm not sure that offering paid API access (even just to cover costs) is compatible with the organization as it exists.
[1] It looks like this might exist at some level, e.g. https://github.com/hartator/wayback-machine-downloader, but I've been trying to use this for a couple of weeks and every day I try I get a HTTP 5xx error or "connection refused."
toomuchtodo
7 hours ago
philipkglass
7 hours ago
Yes, there are documents and third party projects indicating that it has a free public API, but I haven't been able to get it to work. I presume that a paid API would have better availability and the possibility of support.
I just tried waybackpy and I'm getting errors with it too when I try to reproduce their basic demo operation:
>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://nuclearweaponarchive.org"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
Traceback (most recent call last):
File "<python-input-4>", line 1, in <module>
save_api.save()
~~~~~~~~~~~~~^^
File "/Users/xxx/nuclearweapons-archive/venv/lib/python3.13/site-packages/waybackpy/save_api.py", line 210, in save
self.get_save_request_headers()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/Users/xxx/nuclearweapons-archive/venv/lib/python3.13/site-packages/waybackpy/save_api.py", line 99, in get_save_request_headers
raise TooManyRequestsError(
...<4 lines>...
)
waybackpy.exceptions.TooManyRequestsError: Can not save 'https://nuclearweaponarchive.org'. Save request refused by the server. Save Page Now limits saving 15 URLs per minutes. Try waiting for 5 minutes and then try again.toomuchtodo
7 hours ago
Reach out to patron services, support @ archive dot org. Also, your API limits will be higher if you specify your API key from your IA user versus anonymous requests when making requests.
986aignan
2 hours ago
I wish there were some kind of file search for the Wayback Machine. Like "list all .S3M files on members.aol.com before 1998". It would've made looking for obscure nostalgia much easier.
quux
4 hours ago
Is running an IPFS node and pinning the internet archive's collections a good way to do this?
Gormo
an hour ago
> $25-30M per year is a lot for a non-profit
$25 million a year is not remotely a lot for a non-profit doing any kind of work at scale. Wikimedia's budget is about seven times that. My local Goodwill chapter has an annual budget greater than that.
Medium_Taco
11 minutes ago
You're being purposefully obtuse. Most non-profits don't function at scale (neither do they do best at scale). They serve their local community
hinkley
5 hours ago
I'd like a Public Broadcasting Service for the Internet but I'm afraid that money would just be pulled from actual PBS at this point to support it.
xp84
3 hours ago
Too late, PBS is already defunded. CPB was deleted. PBS is now an indie organization without a dime of public money. They should probably rebrand and lose the word “Public”
skywhopper
2 hours ago
Don’t put any stock into the numbers in the article. They are mostly made up out of thin air.