Guy running a Google rival from his laundry room

124 pointsposted 3 hours ago
by coloneltcb

91 Comments

renegat0x0

an hour ago

Well, I created my own domain index. I have not crawled every page inside domains, but it is not my goal.

I have 1542766 domains. Might not be much, but it is an honest work.

It is available as a github repo, so anybody that wants to start crawling has some initial data to kick off.

Links

https://github.com/rumca-js/Internet-Places-Database

hobs

a minute ago

Cant you just request the ICANN’s zone files and have the canonical list of the day?

raybb

40 minutes ago

What a nice project. What inspired this initially?

FYI there's a broken link in your readme:

    https://rumca-js.github.io/internet full internet search

luizfelberti

2 hours ago

I was trying to do this in 2023! The hardest part about building a search engine is not the actual searching though, it is (like others here have pointed out), building your index and crawling the (extremely adversarial) internet, especially when you're running the thing from a single server in your own home without fancy rotating IPs.

I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...

mhitza

an hour ago

You might want to bookmark https://openwebsearch.eu/open-webindex/

While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear.

moduspol

31 minutes ago

Is the common crawl usable for something like this?

https://commoncrawl.org

giancarlostoro

14 minutes ago

Most likely it is, the issue then becomes being able to store and afford the storage for all the files.

wordpad

an hour ago

Why can't crawling be crowd sourced? It would solve ip rotation and spread the load

Poomba

27 minutes ago

That’s how residential proxies work, in a perverse way

ge96

an hour ago

The IP thing is interesting, I was trying to make this CSGO bot one time to scrape steam's prices and there are proxy services out there you rent, tried at least one and it was blocked by steam. So I wonder if people buy real IPs.

6510

28 minutes ago

The crawl seems hard but the difference between having something and not having it is is very obvious. Ordering the results is not. What should go on page 200 and do those results still count as having them?

ofrzeta

an hour ago

"The beefy CPU running this setup, a 32-core AMD EPYC 7532, underlines just how fast technology moves. At the time of its release in 2020, the processor alone would have cost more than $3,000. It can now be had on eBay for less than $200"

why do I never get deals like that when I am shopping for the homelab on eBay?

progval

an hour ago

You need to spend a lot of time looking through badly labeled offers, and be willing to buy from sellers with no reputation.

robrtsql

an hour ago

I searched "AMD EPYC 7532" and there are a ton of listings for $150-$200. Are you just regretful that it wasn't like this when you were shopping parts for your homelab?

_fat_santa

an hour ago

Not for a CPU but earlier this year I bought a Thinkpad workstation off eBay for $500. It's a machine from 2020 and when it was new cost $5,700.

I see this for pretty much all hardware out on eBay, just go back 5 years and watch the price fall 10x.

saalweachter

38 minutes ago

Has eBay fixed their "and then they ship you a box of rocks" problem?

I feel like there was a five year span where everyone I talked to said buying or selling electronics on eBay was a nightmare, so I'm a little curious if I need to re-evaluate my priors.

buildbot

17 minutes ago

Yes, it’s extremely rare to be stuck with a broken/wrong/missing item as a buyer on eBay. Selling is quite risky in some ways because eBay will nearly always side with a buyer. Every missing or broken thing I have purchased has been refunded or replaced. On the other hand, 3 things I have sold were claimed to not arrive. The only case where eBay decided in my favor was when the buyer had signed for the package in a literal USPS office :)

apetresc

26 minutes ago

My understanding is that eBay sides with the buyer on all disputes, to the point of ridiculousness. So you should be fine.

The real issue is being a seller and solving the "and then the customer claims I shipped them a box of rocks" problem.

buildbot

15 minutes ago

Yep selling is way more risky. Ebay might be the most safe (refund wise) marketplace for buyers… I have more trouble with amazon.

ThatMedicIsASpy

42 minutes ago

Epyc7000+MB+256GB-512GB RAM (from china) usually starts at 800 euros + import tax

cheema33

3 hours ago

I tried the search site at https://searcha.page/ by searching for something random and got the following message:

"An error has occurred building the search results."

authnopuz

3 hours ago

hug of death? I fear the temperature will get very high in his laundry room

DannyBee

3 hours ago

I'm sure it depends on how much laundry he is doing - his dryer is probably heated entirely by servers.

He can then exhaust the remaining server heat through the dryer vent stack.

debo_

2 hours ago

Keep going. I love dry humor.

ArekDymalski

an hour ago

Untill the exhaust starts "Feeling leaky" I guess.

ape4

2 hours ago

Change it to a sauna?

BLKNSLVR

2 hours ago

Great innovation plus cloud-skeptic self-hosting. There should be much much more of this!

ytrt54e

9 minutes ago

Crashed? The curse of Hacker News!

tolerance

an hour ago

The great thing about this is that with the decentralization/recentralization of the Web, it may become easier for certain people to roll their own search engines for their respective communities and crawl/index pages only according to their shared tastes.

The bad thing about this is...read above.

mooiedingen

42 minutes ago

Nothing new as it has been done before, the concept is simple enough: step 1: indexer, solr/lucene Step 2: crawler of which there are several foss, build one yourself? or you just run yacy which is a combo of the above, hook combine with an oldschool searx instance and you will be granted the title as seeker by the spirit of Fravia+ who was elder of the searchlores!!! Not only will you filter crap made by machine learning models, but thou shall find what thou seek! I refuse to call a 16 line long for loop triggering in memory loaded tokenized data where data can be anything from a scientific paper hallucinated by a chatbot to a message between two lovers anything intelligent for it is not intelligence but a blob of tokenized fcking data in memory getting triggered for an output by a derp with a 16 line long for loop!!!

iam_saurabh

an hour ago

I love stories like this—tech history is full of scrappy beginnings. Even if this project doesn’t succeed, it reminds us that giant companies aren’t unshakable.

vlucas

2 hours ago

> “I think it’s definitely lowered the barrier,” Lin says of the LLM’s role in enabling DIY search engines. “To me, it seems like the only barrier to actually competing with Google, creating an alternate search engine, is not so much the technology, it’s mostly the market forces.”

Oh sweet summer child

ourguile

2 hours ago

I greatly prefer Kagi https://help.kagi.com/kagi/company/ but it's very nice to see more competition in this space in general.

eurekin

an hour ago

Kagi user here.

When I started using it (~ 2 years) , it was necessary. Google was simply not solving any of my actual issues (software related).

Now, It seems that google might have improved a bit. I check from time to time and the gap isn't as huge, as when Kagi started

shayway

2 hours ago

How does your experience with Searcha compare? It seems to be down at the moment.

the_third_wave

2 hours ago

Do Kagi users get paid for shilling the company? Nearly all threads relating to the subject of search has a few mentionings of the glory of Kagi, often including links to the site. I suspect this is not as effective as the Kagi crew thinks since there is likely to be a large overlap between their potential customers and those who are really turned off by such shilling.

datadrivenangel

an hour ago

Kagi customer here. Not getting paid to shill. I think it's worth occasionally mentioning alternatives that are good enough to pay for so that other people know there are other people using other options.

But full disclosure, sometimes I'm using DuckDuckGo and it's also good enough most of the time that I occasionally forget until I go down some rabbit hole and realize that I'm using the wrong search engine.

dawnerd

an hour ago

Flip side how much does Google pay you to defend their monopoly? Kagi is a solid product with a team that clearly cares about what they’re building. They’re transparent and post change logs when things update. I simply trust them infinitely more than Google.

hamdingers

2 hours ago

Have you considered it's a good product that causes its users to become advocates?

foobarian

an hour ago

Could also be a form of effort justification. [1]

[1] https://en.wikipedia.org/wiki/Effort_justification

tolerance

an hour ago

> The effect is most likely to occur when there are no obvious reasons for performing the task. Because expending effort to perform a useless or unenjoyable task, or experiencing unpleasant consequences in doing so, is cognitively inconsistent (see cognitive dissonance), people are assumed to shift their evaluations of the task in a positive direction to restore consistency.

I’m not following you.

https://dictionary.apa.org/effort-justification

alexjplant

an hour ago

I understand skepticism in the age of LLM-generated content and CAPTCHA-solving bots. What I don't understand is why people choose such weird hills to die on and think that posting about it will accomplish anything. Do you think people will read your comment and go "gee, I was going to use Kagi but now I won't because this random person has a bad feeling about a series of comments they remember seeing"?

I signed up for a specialist forum not too long ago and posted an honest review of a product because I hadn't been able to find one anywhere on the internet. Immediately a bunch of people accused me of being a "shill" for a direct-to-consumer business that's been powered by a Yahoo storefront for the last 20 years, as though a business that's run by a guy with an AOL e-mail address is sophisticated enough to figure out Fiverr and astroturf their reputation on a phpBB forum.

Think about it for just a moment - do you really think that the Hacker News audience is large enough or full of enough tastemakers to sway an alternative search engine's market share? It isn't. If Kagi wanted to do that they'd hire TikTok influencers.

throwaway290

25 minutes ago

no one else would pay for search. people on HN is probably 90% of their total possible market.

testdelacc1

an hour ago

Disclaimer: Not a Kagi user. Unlikely to use it.

I just don’t understand people who get so upset that someone might like something enough to talk about liking it. So upset that they won’t ever try the thing. Like … ok I guess? You do you. It’s just a strange way to make decisions.

At least this is just a consumer product. Worse is when people here say they make technical decisions using the same process. They’d black list certain tech because they’ve heard people talking about how it solved their problems. Also ok, but now I know I should avoid them professionally.

mdaniel

an hour ago

I get the impression it's the volume of the folks who sing its praises. There was a web3 crowd for a while, Bitwarden champions would show up to any mention of a password manager, and (ahem) some AI champions can be over the top

In all of these cases, a reasonable counterpoint is that if it were that applicable for all audiences, one wouldn't need to sing its praises, it would sing its own praises

ufmace

33 minutes ago

It sings its own praises... how exactly? Maybe by a bunch of happy users talking about how they like it and it's a better solution to the problem that the thread or article is about without being explicitly paid? Which is exactly what's happening here and some people are complaining about it?

testdelacc1

an hour ago

How does a password manager sing its own praises?

koakuma-chan

an hour ago

I tried it, it's slow and bad and free tier is only 100 requests, and it's too expensive, and price is unjustified. I use gemini with google search grounding.

lelandbatey

an hour ago

Nope, it's just a nice thing I like. It is nearly the platonic ideal of a search engine for me. It causes me no problems and doesn't try to sell me garbage.

It's like discovering that there a better pair of shoes that're more comfortable. Everybody can use a slightly improved more comfortable pair of shoes, so it comes up frequently.

tmdetect

2 hours ago

Kagi is a polished product. This is drying someones laundry.

HardCodedBias

42 minutes ago

I know that Google engineers have a cushy life but I actually find it unlikely that a guy, who isn't attempting some radical new type of search (like pagerank back in the day) can hope to compete with the orgs in Google who support search.

Again, those orgs are likely too comfortable and less productive than people would like, but we're talking about many-many thousands and depending upon how you define "the work" of search upwards of 10k.

I didn't see any new secret sauce in the article and Google is has said that since 2015 (?) Google Brain has been involved in search.

This is not to say that Google couldn't be dislodged by search via LLM or similar, that is "new" research.

the_real_cher

2 hours ago

I always wondered why someone couldn't do this.

Google was invented many years ago by two guys in a dorm room and since then there's been so many white papers and advancements in the public sphere and the actual underlying problem has not changed that much, that it seems like it could be done by a small group or independent person.

non_aligned

2 hours ago

I think there are two factors that helped Google. First, the search engine landscape back then was absolutely abysmal. I'm sure someone will chime in saying that it's abysmal today as well, but the reality is that 99%+ of consumer searches get good results today. And that's simply because the nature of search has changed: we have billions of people using the internet, and they overwhelmingly just search for products to buy, local restaurants that offer takeout, or for familiar pop content to watch or listen to. And there's some SEO spam there, but also pretty fierce quality assurance by search engines.

Second, the internet was different: when all nerds declared that Google is good, that was CNN-grade newsworthy (and CNN used to matter a lot more back then), simply because the internet seemed kinda important, but there was no other authority on the topic. Today, that's not the case. If you need someone to opine on the internet on air, you invite some political pundit or a business analyst.

So no, I don't think you can repeat the success of Google the same way. It was a product of its time.

dec0dedab0de

2 hours ago

Crawling is much more difficult than it used to be. Significantly more content is behind a login, Javascript is required for way more than it should be, and almost the entire web is behind cloudflare or another type of captcha.

jrm4

2 hours ago

More to the point, it's a shame that we can't collectively grok (dammit, they took that from us too) concepts like "personal" and/or "curated" directories, e.g. individual and group wikis and so forth on perhaps more directed topics with lists of good links.

cosmicgadget

a few seconds ago

Other than the obvious (but surmountable) technical challenges with crawling and indexing, trying to establish "goodness" for a given user is tough. For a blogger it will be "hey, you are reading this so you probably like what I like". That's often true but as soon as you try to have a centralized service with arbitrary users, it is hard to do anything better than filtering purely commercial content.

sdf4j

an hour ago

what you mean we can't? there are a lot of curated content directories out there.

jrm4

35 minutes ago

Right, I suppose I mean "getting more people to think about why a few of these bookmarked for your favorite topics, especially tied to a trustworthy person, is a million times better than just hitting up Google."

Or, perhaps, a "a better Google should just take you to these."

Something like that.

balder1991

2 hours ago

We have Marginalia which serves a specific use-case: https://about.marginalia-search.com/

mdaniel

an hour ago

That's what I was expecting this submission to be about, although to be honest I'm not certain that Marginalia would want the influx of a fastcompany sized tire kicking

CalRobert

2 hours ago

Among other things, I think crawling is a lot harder now.

ambicapter

2 hours ago

Google basically invented the modern cloud in order to efficiently use the hardware necessary to actually build those search engine indices. It's not really a question of implementing a good algorithm and away we go.

lif

an hour ago

Provided they have the kind of massive government support Google has had from the get-go, sure!

OutOfHere

2 hours ago

The actual underlying problem has changed altogether. Pagerank is easily gamed by SEO.

Search candidates and rankings now require assessment by LLM. Moreover, as a default, users want the results intelligently synthesized into a text response with references rather than as raw results.

Crawling too requires innovative approaches to bypass server filters.

I doubt any independent person can afford to run a vector database or LLMs at immense scale.

kcbanner

2 hours ago

> users want the results intelligently synthesized into a text response with references rather than as raw results.

The reason I pay for Kagi is that I specifically don't want this to occur.

OutOfHere

2 hours ago

If you pay for a service (web search) that 99.9% use for free, you're an extreme outlier, and not necessarily a justifiable one either. After all, DDG, Google and various others still have raw results for free.

Workaccount2

2 hours ago

How much do you technologically relate to the average person on the street though?

Every person I have seen (outside the tiny tech bubble) google something has just read the AI overview without skipping a beat.

yepitwas

2 hours ago

That's worrisome since I've seen those be for-sure wrong a pretty high percentage of the time.

[EDIT] Incidentally, are there any sites that do actual web search any more, better than Yandex? I'd rather avoid a Russian site if I can, but there are whole topics where it's impossible to find anything useful on heavily "massaged" allegedly-Web-search-but-not-really sites like Google and DDG (Bing), but I can find what I want on page 1 or 2 of a Yandex search. Is Kagi as good as that, or is their index simply ignoring a whole bunch of the Web like so many others? I don't mind paying.

degamad

an hour ago

Google "Web" results (not the default results you get when you search) still seem okay for me. You can force them with the udm=14 url trick, or select the "Web" tab in the results. No AI, no images or shopping results, and slightly better text results.

franktankbank

an hour ago

Yep, same here. Ask it "should I wash venison tenderloin" and you get an initial "No, because" followed by a generally "yes its important to clean including with water" in the longer description. Wow a self contradictory answer! Good job!

jkestner

2 hours ago

We’re being force fed them. I’m an AI hater and I catch myself reading those sometimes.

Yes, people want the answer directly. Google wants you to stay on their site to read some mishmash. I think the ideal would be to immediately go to the source’s site.

throwmeaway222

2 hours ago

At this point the web is also so centralized you only need 3 bookmarks these days (your news, youtube and Amazon)

A search is just learning what you don't know and AI does a better job than search has ever done for me - and I'm in tech.

ricardo81

2 hours ago

>Pagerank

Also a lot of site owners are reluctant to link out. So much so that 'nofollow' had been reduced to a hint rather than a directive.

iamacyborg

2 hours ago

> Moreover, as a default, users want the results intelligently synthesized into a text response with references rather than as raw results.

Citation needed

OutOfHere

2 hours ago

You mean all the users of chat services aren't evidence? Chat services increasingly incorporate web links for references in their responses, and this is as the users seek. The tide continues to shift from traditional search to LLM synthesis.

iamacyborg

2 hours ago

I suspect there are more users of traditional search than there are of llm chat apps.

p3rls

11 minutes ago

i've been thinking that google could use its own AI to evaluate URLs instead of relying on pagerank and backlinks which are almost completely valueless as a signal in 2025. in my niche there's more slop than ever being produced daily and it's all hitting rank 1. it's tragic what google is doing to the internet.

Oarch

an hour ago

I'm sure there's a money laundering joke in here somewhere