miki123211
5 days ago
If you need this sort of thing in any other language, there's a free, no-auth, no-api-key-required, no-strings-attached API that can do this at https://jina.ai/reader/
You just fetch a URL like `https://r.jina.ai/https://www.asimov.press/p/mitochondria`, and get a markdown document for the "inner" URL.
I've actually used this and it's not perfect, there are websites (mostly those behind Cloudflare and other such proxies) that it can't handle, but it does 90% of the job, and is an one-liner in most languages with a decent HTTP requests library.
petercooper
5 days ago
I use this too and, not to detract from your enthusiasm, it's not exactly no-strings-attached. There's a token limit on free use and you can't use it for any commercial purposes. Luckily the pricing for unrestricted use is reasonable though at 2 cents per million tokens.
People will also want to note that it's LLM-powered which has pros and cons. One pro being that you can download and run their model yourself for non commercial use cases: https://huggingface.co/jinaai/reader-lm-1.5b
JohannesKauf
5 days ago
Thanks, Jina actually looks quite nice for use in LLMs.
I also provide a REST API [1] that you can use for free (within limits). However you have get an API Key by registering with Github (see reason below).
---
The demo was previously hosted on Vercel. Someone misused the demo and send ~5 million requests per day. And would not stop — which quickly brought me over the bandwidth limits of Vercel. And bandwidth is really really expensive!
So that is the reason for requiring API Keys and hosting it on a VPS… Lessons learned!
emptiestplace
5 days ago
Seems pretty risky to not implement rate limits either way.
JohannesKauf
5 days ago
The problem was: Doing rate limiting on the application level was not enough. Once the request hit my backend the incoming bandwidth was already consumed — and I was charged for it.
I contacted Vercel's Support to block that specific IP address but unfortunately they weren't helpful.
emptiestplace
5 days ago
So you're probably still vulnerable to this even with the key requirement, but they stopped once you removed the incentive? Did you notice what they were scraping?
JohannesKauf
5 days ago
Sorry, I mixed up a few topics here:
- Moved everything to a VPS - way better value for money. Extra TB of traffic only costs €1-10 with Hetzner/DigitalOcean compared to 400€ with Vercel's old pricing.
- Put Cloudflare in front - gives me an extra layer of control (if I ever need it)
- Built a proper REST API - now there's an official way to use the converter programmatically
- Made email registration mandatory for API keys - lets me reach out before having to block anyone
That other server was probably running a scraper and then converting the html-websites to markdown. After about 2 weeks they noticed that I was just returning garbage and it stopped :)
emptiestplace
5 days ago
Ah! Makes sense now, thanks for sharing.
I've had good success with Cloudflare's free-tier features for rate limiting. If you haven't tried it, it only takes a couple minutes to enable and should be pretty set-and-forget for your API.