hackernews client

Guy running a Google rival from his laundry room

246 pointsposted 5 months ago

by coloneltcb

(fastcompany.com)

155 Comments

renegat0x0

5 months ago

Well, I created my own domain index. I have not crawled every page inside domains, but it is not my goal.

I have 1542766 domains. Might not be much, but it is an honest work.

It is available as a github repo, so anybody that wants to start crawling has some initial data to kick off.

Links

https://github.com/rumca-js/Internet-Places-Database

raybb

5 months ago

What a nice project. What inspired this initially?

FYI there's a broken link in your readme:

    https://rumca-js.github.io/internet full internet search

renegat0x0

5 months ago

thanks, I replaced it with a other link demo

hobs

5 months ago

Cant you just request the ICANN’s zone files and have the canonical list of the day?

Any link list, or domain list is not worth much without any rating, or meta. I lead a hobby project, and I am not expert, so I provide ratings based on what kind of data pages provide (title, social, description), and my own manual voting system. It is not ideal, but it is something. Also I provide tags, so it is easily known what the domain provides, or domains can be filtered by tags.

I know that you cannot count and visit every domain, so the list will never be finished, but I am happy with the results.

hobs

5 months ago

Well, if you are curating every link them its a different story, and looks like a more classic webring - I missed that part of the work - I thought it looked like a big set of crawler data that wasn't as manually curated.

beaugunderson

5 months ago

you can, though you must provide a reason compelling enough to the person maintaining access (I provided a few sentences and was approved for most but maybe 20% of registrars declined my request):

https://czds.icann.org/home

also be prepared for thousands of emails about status changes to your access.

egberts1

5 months ago

Avoiding GIGO (Garbage In, Garbage Out).

This is why we have computer-variants of Library Science and Archeology, Forensic Science and a bunch of other advanced knowledge (not AI, mind you).

hobs

5 months ago

I don't see how this applies as its aggregating a bunch of stuff from random crawlers - if you want to crawl a list of actual domains that's generally considered the list of things that could resolve, so seems like a good starting place.

egberts1

5 months ago

Smashing stuff together by pure probablistic word association like AI do today is really, tsk tsk.

That's why it is important to clearly define words like love, oh wait.

didip

5 months ago

This is amazing. Thanks for sharing!

bufferoverflow

5 months ago

[dead]

luizfelberti

5 months ago

I was trying to do this in 2023! The hardest part about building a search engine is not the actual searching though, it is (like others here have pointed out), building your index and crawling the (extremely adversarial) internet, especially when you're running the thing from a single server in your own home without fancy rotating IPs.

I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...

mhitza

5 months ago

You might want to bookmark https://openwebsearch.eu/open-webindex/

While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear.

3RTB297

5 months ago

You know, it's possible the cure to an adversarial internet is to just have some non-profit serve as a repo for a universal clearnet index that anyone can access to build their own search engine. That way we don't have endless captchas and anubis and Cloudflare tests every time I try and look for a recipe online. Why send AI scrapers to crawl literally everything when you're getting the data for free?

I'll add it to the mile-long list of things that should exist and be online public goods.

moduspol

5 months ago

Is the common crawl usable for something like this?

https://commoncrawl.org

chiefsearchaco

5 months ago

I'm the creator of searcha.page and seek.ninja, those are the basis of my index. The biggest problem with ONLY using that is freshness. I've started my own crawling too, but for sure common crawl will backfill a TON of good pages. It's priceless and I would say common crawl should be any search engines starting point. I have 2 billion pages from common crawl! There were a lot more but I had to scrub them out due to resources. My native crawling is much more targeted and I'd be lucky to pull 100k but as long as my heuristics for choosing the right targets it will be very high value pulls.

giancarlostoro

5 months ago

Most likely it is, the issue then becomes being able to store and afford the storage for all the files.

moduspol

5 months ago

Sure, and that's not easy, but it's a lot easier than having to crawl the entire public Internet yourself.

wordpad

5 months ago

Why can't crawling be crowd sourced? It would solve ip rotation and spread the load

6510

5 months ago

https://yacy.net

catlikesshrimp

5 months ago

Too bad it doesn't support android. It is much more energy efficient than anything else I can spare (for 100% uptime contribution)

Poomba

5 months ago

That’s how residential proxies work, in a perverse way

user

5 months ago

[deleted]

chiefsearchaco

5 months ago

Common crawl sort of serves this function. I use it. It's a really good foundation.

6510

5 months ago

The crawl seems hard but the difference between having something and not having it is is very obvious. Ordering the results is not. What should go on page 200 and do those results still count as having them?

ge96

5 months ago

The IP thing is interesting, I was trying to make this CSGO bot one time to scrape steam's prices and there are proxy services out there you rent, tried at least one and it was blocked by steam. So I wonder if people buy real IPs.

kccqzy

5 months ago

Yeah people buy residential IPs on the black market. They are essentially infected home PCs and botnets.

Bratmon

5 months ago

Not just the black market anymore!

https://www.proxyrack.com/residential-proxies/

immibis

5 months ago

you can get paid about $0.10/GB in cryptocurrency (at a few GB per month) to run one on your PC. Apparently they also just buy actual connections sometimes. It's not even unethical - it's just two groups of equally bad businesspeople trying to spend money to block the other one.

typpilol

5 months ago

I've heard a few horror stories... Since the people using residential proxies aren't necessarily always good people

cheema33

5 months ago

I tried the search site at https://searcha.page/ by searching for something random and got the following message:

"An error has occurred building the search results."

authnopuz

5 months ago

hug of death? I fear the temperature will get very high in his laundry room

DannyBee

5 months ago

I'm sure it depends on how much laundry he is doing - his dryer is probably heated entirely by servers.

He can then exhaust the remaining server heat through the dryer vent stack.

debo_

5 months ago

Keep going. I love dry humor.

egberts1

5 months ago

Its dryer sheets soften the soul.

user

5 months ago

[deleted]

ArekDymalski

5 months ago

Untill the exhaust starts "Feeling leaky" I guess.

robofanatic

5 months ago

Might not even need a dryer :-)

ape4

5 months ago

Change it to a sauna?

doublerabbit

5 months ago

I thought of this a whole ago when I was a Datacentre monkey. In the winter it was pleasant to walk down the hot aisles.

However the exhausted hot air never had the same feel of a sauna. It left the air stale and dry.

chiefsearchaco

5 months ago

Yep, my usage increased 20x week over week. It was actually the context expansion that was my bottleneck, not the search itself. My usage graph looks almost vertical. Not sure if this counts as a good week or a bad week.

HelloUsername

5 months ago

Yup; same at https://seek.ninja/s?q=beatles

eschulz

5 months ago

Before this happened to me, my first search returned an impressive SERP.

lucb1e

5 months ago

It claims I reached the article limit. The last time I saw a fastcompany link must have been a decade ago! I was nostalgically looking forward to read another article of theirs. Alas...

https://archive.is/HA7y4

Some bits and pieces:

> his new search engine, the robust Search-a-Page <https://searcha.page>, which has a privacy-focused variant called Seek Ninja <https://seek.ninja>

> The secret to making it all happen? Large language models. “What I’m doing is actually very traditional search,” Pearce says. “It’s what Google did probably 20 years ago, except the only tweak is that I do use AI to do keyword expansion and assist with the context understanding

> Fellow ambitious hobbyist Wilson Lin, who on his personal blog <https://blog.wilsonl.in/search-engine/> recently described his efforts to create a search engine of his own, took the opposite approach from Pearce.

> And then there’s the concept of doing a small-site search, along the lines of the noncommercial search engine Marginalia <https://marginalia-search.com>, which favors small sites over Big Tech

And the obvious answer to the title: "Why the laundry room? Two reasons: Heat and noise." It runs on a a 32-core AMD EPYC 7532, half a terabyte of RAM, and "all in, cost $5,000, with about $3,000 of that going toward storage"

udkl

5 months ago

I absolutely devoured Wilson Lins articles recently .. they are very high quality and informative for any amateur interested in search engines and LLMs! - https://blog.wilsonl.in/search-engine/

wvenable

5 months ago

Reader mode in Firefox (plus sometimes a page refresh) gets me past most paywalls -- including this article.

ofrzeta

5 months ago

"The beefy CPU running this setup, a 32-core AMD EPYC 7532, underlines just how fast technology moves. At the time of its release in 2020, the processor alone would have cost more than $3,000. It can now be had on eBay for less than $200"

why do I never get deals like that when I am shopping for the homelab on eBay?

progval

5 months ago

You need to spend a lot of time looking through badly labeled offers, and be willing to buy from sellers with no reputation.

_fat_santa

5 months ago

Not for a CPU but earlier this year I bought a Thinkpad workstation off eBay for $500. It's a machine from 2020 and when it was new cost $5,700.

I see this for pretty much all hardware out on eBay, just go back 5 years and watch the price fall 10x.

saalweachter

5 months ago

Has eBay fixed their "and then they ship you a box of rocks" problem?

I feel like there was a five year span where everyone I talked to said buying or selling electronics on eBay was a nightmare, so I'm a little curious if I need to re-evaluate my priors.

buildbot

5 months ago

Yes, it’s extremely rare to be stuck with a broken/wrong/missing item as a buyer on eBay. Selling is quite risky in some ways because eBay will nearly always side with a buyer. Every missing or broken thing I have purchased has been refunded or replaced. On the other hand, 3 things I have sold were claimed to not arrive. The only case where eBay decided in my favor was when the buyer had signed for the package in a literal USPS office :)

throwawayffffas

5 months ago

You don't get that with used old stuff, you get it with unrealistic low prices for new stuff.

A 7532 CPU is now ewaste for all the datacenters out there 1/10 of original price is reasonable, but the latest Nvidia GPU for 200 bucks is obviously a scam.

apetresc

5 months ago

My understanding is that eBay sides with the buyer on all disputes, to the point of ridiculousness. So you should be fine.

The real issue is being a seller and solving the "and then the customer claims I shipped them a box of rocks" problem.

buildbot

5 months ago

Yep selling is way more risky. Ebay might be the most safe (refund wise) marketplace for buyers… I have more trouble with amazon.

accrual

5 months ago

> Has eBay fixed their "and then they ship you a box of rocks" problem?

I've personally never had that problem after over a decade and hundreds of purchases on eBay. I've had some defective parts, but never outright fraud. IME eBay favors buyers.

mjh2539

5 months ago

Every single laptop I've bought off of ebay (all of which were used) over the past ten years has functioned perfectly and flawlessly. You just pay attention to the number of recent sales the account has had and their overall rating.

robrtsql

5 months ago

I searched "AMD EPYC 7532" and there are a ton of listings for $150-$200. Are you just regretful that it wasn't like this when you were shopping parts for your homelab?

throwawayffffas

5 months ago

I got a 7551p plus motherboard and ram for about 600 bucks from China this January. I may have overpaid but it works great, and gets the job done.

Gormo

5 months ago

TheServerStore.com often has good deals. I actually bought a brand new 64-core EPYC 7702 server with 256 GB RAM and 8TB NVMe storage for about $3K fully assembled earlier this year.

ThatMedicIsASpy

5 months ago

Epyc7000+MB+256GB-512GB RAM (from china) usually starts at 800 euros + import tax

chiefsearchaco

5 months ago

Get a QC type chip and roll the dice, that's how I got mine. The biggest cost for me is disk and to a lesser extent ram, the chip itself was relatively cheap.

user

5 months ago

[deleted]

renewiltord

5 months ago

AliExpress broseph. You'll get it in no time. I've gotten. Go do QS if you have some risk tolerance and ES if you also have time tolerance.

phendrenad2

5 months ago

This is a cool project, and I hope he has fun with it.

I've daydreamed about how I'd create my own search engine so, so many times. But I always run into an impassable wall: The internet now isn't at all the same as the internet in 1999.

Discovery isn't really that useful. If you find someone's self-hosted blog about dinosaurs, it probably hasn't been updated since 2004, all the links and images are broken, and it's just thoroughly upstaged by Wikipedia and the Smithsonian. Sure, it's fun to find these quirky sites, but they aren't as valuable as they once were.

We've basically come full circle to the AOL model, where there are "hubs" of content that cater to specific categories. YouTube has ALL the long-form essays. Tiktok has ALL the humorous videos. Medium has ALL the opinion pieces. Reddit has ALL the flame wars. Mayo Clinic has ALL the drug side-effects. Amazon has ALL the shopping. Ebay has ALL the collectables.

None of these big companies want nasty little web crawlers poking and prodding their site. But they accept Google crawlers, because Google brings them users. Are they going to be that friendly to your crawler?

Of course, I still dream. Maybe a hub-based internet needs a hub-aware search engine?

chiefsearchaco

5 months ago

Well I can't respond to everyone - I am the one running the search engine. And yes, it did crash today from load. Usage increased 20x this week vs last and I was totally unprepared. I don't know if that counts as a good launch or a bad one. For some reason in my head I imagined usage would be some slow steady ramp.

Thank you for those who tried it, and I'm sorry if you were one of the people it didn't perform for. As far as load goes this was the first day it truly had a "trial by fire".

OJFord

5 months ago

'Google rival' is quite a stretch, surely 'search engine' is not just more accurate, but clearer too with all that Google does today, as if that's new.

amelius

5 months ago

https://archive.ph/HA7y4

BLKNSLVR

5 months ago

Great innovation plus cloud-skeptic self-hosting. There should be much much more of this!

evanjrowley

5 months ago

Search websites by Ryan Pearce:

- SearchaPage - Web Search Engine https://searcha.page/

- Seek Ninja - Stealthy Search Engine https://seek.ninja/

317070

5 months ago

https://searcha.page/s?q=blog https://seek.ninja/s?q=blog

Both of them are erroring out right now?

chiefsearchaco

5 months ago

Yep, it was load. Usage increased 20x week over week, especially today. I think I failed my trial by fire. Got a good plan for scaling capacity and better UX for when its under strain.

kitd

5 months ago

Were you trying them via Chrome, by any chance? ;)

jslakro

5 months ago

firefox here and it's not working

thm

5 months ago

I'm running one for news https://mozberg.com - not in my basement though.

cosmicgadget

5 months ago

Where is it?

lxe

5 months ago

This is a cool hobby project, but why is this notable? Why a FastCompany article? I'm trying to figure out anything that sets this apart from thousands of other little hobby search projects.

I understand companies like Perplexity or Brave or DuckDuckGo "rivialing Google", but building a hobby index and crawler is nice, and worthy of a "Show HN: "... but an actual media article?

gowld

5 months ago

It's only notable as a clickbait narrative for ignorant readers -- FastCompany's target market

the_real_cher

5 months ago

I always wondered why someone couldn't do this.

Google was invented many years ago by two guys in a dorm room and since then there's been so many white papers and advancements in the public sphere and the actual underlying problem has not changed that much, that it seems like it could be done by a small group or independent person.

dec0dedab0de

5 months ago

Crawling is much more difficult than it used to be. Significantly more content is behind a login, Javascript is required for way more than it should be, and almost the entire web is behind cloudflare or another type of captcha.

marginalia_nu

5 months ago

These things are actually fairly small problems.

The parts that absolutely require JS can't be reliably linked to and nobody indexes that stuff. Most apparent SPA:s serve a HTML alternative if you don't claim to be a web browser in the UA.

Cloudflare and the like are also fairly easy to deal with as long as your crawler is well behaved. You can register the fingerprint and mostly get access to cf:ed websites.

non_aligned

5 months ago

I think there are two factors that helped Google. First, the search engine landscape back then was absolutely abysmal. I'm sure someone will chime in saying that it's abysmal today as well, but the reality is that 99%+ of consumer searches get good results today. And that's simply because the nature of search has changed: we have billions of people using the internet, and they overwhelmingly just search for products to buy, local restaurants that offer takeout, or for familiar pop content to watch or listen to. And there's some SEO spam there, but also pretty fierce quality assurance by search engines.

Second, the internet was different: when all nerds declared that Google is good, that was CNN-grade newsworthy (and CNN used to matter a lot more back then), simply because the internet seemed kinda important, but there was no other authority on the topic. Today, that's not the case. If you need someone to opine on the internet on air, you invite some political pundit or a business analyst.

So no, I don't think you can repeat the success of Google the same way. It was a product of its time.

snek_case

5 months ago

Google maps is probably a big moat that's very hard to replicate. You can't as easily just crawl all of that data. It's not easy to generate directions. The average user doesn't want to use your search engine for one thing and Google for everything else, they just want a one stop shop for search.

cadamsdotcom

5 months ago

The average user might want a one stop shop.

That's not a showstopper. It's ok to not be everything to everyone.

balder1991

5 months ago

We have Marginalia which serves a specific use-case: https://about.marginalia-search.com/

mdaniel

5 months ago

That's what I was expecting this submission to be about, although to be honest I'm not certain that Marginalia would want the influx of a fastcompany sized tire kicking

marginalia_nu

5 months ago

To be fair I'm on a colocated server now. No more apartment hosting for me.

OutOfHere

5 months ago

The actual underlying problem has changed altogether. Pagerank is easily gamed by SEO.

Search candidates and rankings now require assessment by LLM. Moreover, as a default, users want the results intelligently synthesized into a text response with references rather than as raw results.

Crawling too requires innovative approaches to bypass server filters.

I doubt any independent person can afford to run a vector database or LLMs at immense scale.

kcbanner

5 months ago

> users want the results intelligently synthesized into a text response with references rather than as raw results.

The reason I pay for Kagi is that I specifically don't want this to occur.

OutOfHere

5 months ago

If you pay for a service (web search) that 99.9% use for free, you're an extreme outlier, and not necessarily a justifiable one either. After all, DDG, Google and various others still have raw results for free.

Workaccount2

5 months ago

How much do you technologically relate to the average person on the street though?

Every person I have seen (outside the tiny tech bubble) google something has just read the AI overview without skipping a beat.

yepitwas

5 months ago

That's worrisome since I've seen those be for-sure wrong a pretty high percentage of the time.

[EDIT] Incidentally, are there any sites that do actual web search any more, better than Yandex? I'd rather avoid a Russian site if I can, but there are whole topics where it's impossible to find anything useful on heavily "massaged" allegedly-Web-search-but-not-really sites like Google and DDG (Bing), but I can find what I want on page 1 or 2 of a Yandex search. Is Kagi as good as that, or is their index simply ignoring a whole bunch of the Web like so many others? I don't mind paying.

degamad

5 months ago

Google "Web" results (not the default results you get when you search) still seem okay for me. You can force them with the udm=14 url trick, or select the "Web" tab in the results. No AI, no images or shopping results, and slightly better text results.

franktankbank

5 months ago

Yep, same here. Ask it "should I wash venison tenderloin" and you get an initial "No, because" followed by a generally "yes its important to clean including with water" in the longer description. Wow a self contradictory answer! Good job!

jkestner

5 months ago

We’re being force fed them. I’m an AI hater and I catch myself reading those sometimes.

Yes, people want the answer directly. Google wants you to stay on their site to read some mishmash. I think the ideal would be to immediately go to the source’s site.

throwmeaway222

5 months ago

At this point the web is also so centralized you only need 3 bookmarks these days (your news, youtube and Amazon)

A search is just learning what you don't know and AI does a better job than search has ever done for me - and I'm in tech.

ricardo81

5 months ago

>Pagerank

Also a lot of site owners are reluctant to link out. So much so that 'nofollow' had been reduced to a hint rather than a directive.

freeopinion

5 months ago

> users want the results intelligently synthesized into a text response with references rather than as raw results

This leads directly to another big change.

People used to submit their sites to search engines and now they might actively block search engines. So a search engine author might have to spend a lot of effort in adversarial games.

iamacyborg

5 months ago

> Moreover, as a default, users want the results intelligently synthesized into a text response with references rather than as raw results.

Citation needed

OutOfHere

5 months ago

You mean all the users of chat services aren't evidence? Chat services increasingly incorporate web links for references in their responses, and this is as the users seek. The tide continues to shift from traditional search to LLM synthesis.

iamacyborg

5 months ago

I suspect there are more users of traditional search than there are of llm chat apps.

freeopinion

5 months ago

I suspect that chat apps dominate (80+%?) the under-20 demographic, and have a sizable chunk of the under-30 demographic. Within the next five years it will probably represent 50+% of total search traffic. Maybe it already does. It makes sense that any search site that wants to be in the game tomorrow would keep racing down the AI chat path.

jrm4

5 months ago

More to the point, it's a shame that we can't collectively grok (dammit, they took that from us too) concepts like "personal" and/or "curated" directories, e.g. individual and group wikis and so forth on perhaps more directed topics with lists of good links.

cosmicgadget

5 months ago

Other than the obvious (but surmountable) technical challenges with crawling and indexing, trying to establish "goodness" for a given user is tough. For a blogger it will be "hey, you are reading this so you probably like what I like". That's often true but as soon as you try to have a centralized service with arbitrary users, it is hard to do anything better than filtering purely commercial content.

sdf4j

5 months ago

what you mean we can't? there are a lot of curated content directories out there.

jrm4

5 months ago

Right, I suppose I mean "getting more people to think about why a few of these bookmarked for your favorite topics, especially tied to a trustworthy person, is a million times better than just hitting up Google."

Or, perhaps, a "a better Google should just take you to these."

Something like that.

CalRobert

5 months ago

Among other things, I think crawling is a lot harder now.

ambicapter

5 months ago

Google basically invented the modern cloud in order to efficiently use the hardware necessary to actually build those search engine indices. It's not really a question of implementing a good algorithm and away we go.

lif

5 months ago

Provided they have the kind of massive government support Google has had from the get-go, sure!

_joel

5 months ago

The photo of the power socket right next to the sink looks safe

throwway120385

5 months ago

Looks like a GFCI. Should be fine.

chiefsearchaco

5 months ago

I'm planning on running a cord through my wall, I just keep putting it off :D

zrobotics

5 months ago

Absolutely don't run an extension cord through a wall, that's only slightly less of a fire hazard than storing a gasoline can on top of the server. Extension cords are normally very derated, expecting occasional use and ample cooking not being inside a wall. Better to keep it as-is, or have a 20A dedicated circuit run.

authnopuz

5 months ago

https://archive.is/HA7y4

HardCodedBias

5 months ago

I know that Google engineers have a cushy life but I actually find it unlikely that a guy, who isn't attempting some radical new type of search (like pagerank back in the day) can hope to compete with the orgs in Google who support search.

Again, those orgs are likely too comfortable and less productive than people would like, but we're talking about many-many thousands and depending upon how you define "the work" of search upwards of 10k.

I didn't see any new secret sauce in the article and Google is has said that since 2015 (?) Google Brain has been involved in search.

This is not to say that Google couldn't be dislodged by search via LLM or similar, that is "new" research.

freeopinion

5 months ago

If you wrote that 100 people could outwork one person, I'd nod my head. If you wrote that 10k people could outwork 1k people, I'd shrug. If you tell me that 100 people can combine to tie my shoe faster than I can, I'd question that.

Building a state-of-the-art search engine is not shoelaces. But upwards of 10k workers is not impressive in the right direction.

One person starting out with anything at all can quickly grow into one person with one or two really innovative ideas. One or two good ideas can catch fire pretty quickly. Don't be too dismissive.

freedomben

5 months ago

> Why the laundry room? Two reasons: Heat and noise. Pearce’s server was initially in his bedroom, but the machine was so hot, it actually made it too uncomfortable to sleep.

This is a rite of passage and a badge of honor for homelabbers/tinkerers/hackers to discover for themselves IMHO. If you haven't tried it, you should. The heat is bad enough to warrant moving it, but add the noise too, sprinkle in a few nights of bad sleep, and it becomes an effective form of torture :-D

Just don't decide to move it to a closet unless you also install some fans in there. I ended up finding a cozy spot under the staircase which worked quite well

tolerance

5 months ago

The great thing about this is that with the decentralization/recentralization of the Web, it may become easier for certain people to roll their own search engines for their respective communities and crawl/index pages only according to their shared tastes.

The bad thing about this is...read above.

risico

5 months ago

One of my dream projects as well, sadly it feels a lot harder to crawl the internet these days, as others have said around here as well.

What are some good practices these days to ensure a good crawl/scrape? Invest in proxies, preferably residential?

mooiedingen

5 months ago

Nothing new as it has been done before, the concept is simple enough: step 1: indexer, solr/lucene Step 2: crawler of which there are several foss, build one yourself? or you just run yacy which is a combo of the above, hook combine with an oldschool searx instance and you will be granted the title as seeker by the spirit of Fravia+ who was elder of the searchlores!!! Not only will you filter crap made by machine learning models, but thou shall find what thou seek! I refuse to call a 16 line long for loop triggering in memory loaded tokenized data where data can be anything from a scientific paper hallucinated by a chatbot to a message between two lovers anything intelligent for it is not intelligence but a blob of tokenized fcking data in memory getting triggered for an output by a derp with a 16 line long for loop!!!

rurban

5 months ago

xapian is easier and faster. No Java memory eater.

I've once built a good company wide search engine with custom crawlers, and result hooks, eg to crazy SAP or other ticket systems. Gmane was also legendary.

rurban

5 months ago

Just switched to Search Ninja as my default search engine on my Android firefox. No tracking, faster, better than duckduckgo. Now I'm just looking how to get search suggestions enabled.

jp191919

5 months ago

I wonder what his ISP is, and what speed he has to subscribe to...

iam_saurabh

5 months ago

I love stories like this—tech history is full of scrappy beginnings. Even if this project doesn’t succeed, it reminds us that giant companies aren’t unshakable.

ourguile

5 months ago

I greatly prefer Kagi https://help.kagi.com/kagi/company/ but it's very nice to see more competition in this space in general.

eurekin

5 months ago

Kagi user here.

When I started using it (~ 2 years) , it was necessary. Google was simply not solving any of my actual issues (software related).

Now, It seems that google might have improved a bit. I check from time to time and the gap isn't as huge, as when Kagi started

shayway

5 months ago

How does your experience with Searcha compare? It seems to be down at the moment.

tmdetect

5 months ago

Kagi is a polished product. This is drying someones laundry.

the_third_wave

5 months ago

[flagged]

tomhow

5 months ago

Please don't post insinuations about astroturfing, shilling, brigading, foreign agents, and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data.

https://news.ycombinator.com/newsguidelines.html

dawnerd

5 months ago

Flip side how much does Google pay you to defend their monopoly? Kagi is a solid product with a team that clearly cares about what they’re building. They’re transparent and post change logs when things update. I simply trust them infinitely more than Google.

hamdingers

5 months ago

Have you considered it's a good product that causes its users to become advocates?

foobarian

5 months ago

Could also be a form of effort justification. [1]

[1] https://en.wikipedia.org/wiki/Effort_justification

tolerance

5 months ago

> The effect is most likely to occur when there are no obvious reasons for performing the task. Because expending effort to perform a useless or unenjoyable task, or experiencing unpleasant consequences in doing so, is cognitively inconsistent (see cognitive dissonance), people are assumed to shift their evaluations of the task in a positive direction to restore consistency.

I’m not following you.

https://dictionary.apa.org/effort-justification

immibis

5 months ago

It's not limited to physical effort. Wikipedia's example has embarassment in place of effort; presumably, money could also work.

tolerance

5 months ago

I interpreted to mean that using a search engine is “useless or unenjoyable, or experiencing unpleasant consequences...”, with attention given to the last two feelings. And I can't figure out what that has to do with people who like Kagi and why it’s wrong or irritating for them to do so.

Granted I’ve been annoyed by similar occurrences with other services, but not to the point of suspecting collusion between the service and the public like the GP comment did.

Searching on the web takes effort. I don’t think this sentiment is controversial. Especially not on HN.

But do you think that because/if searching on the web takes effort and because people have to pay for Kagi, they are compelled to exaggerate its usefulness in public to justify the cost?

immibis

5 months ago

Switching to a nondefault technology takes effort and switching to Kagi in particular also costs money, which is also effort for the purpose of the psychological effect known as effort justification. Therefore, people would be likely to rate switching to Kagi as a good thing even if it was exactly the same as Google (says the effect). Therefore, people who say Kagi is good find it exactly the same a Google (implies the commenter).

glenstein

5 months ago

TIL about effort justification! I think signing up for Kagi is not particularly effort-intensive however.

datadrivenangel

5 months ago

Kagi customer here. Not getting paid to shill. I think it's worth occasionally mentioning alternatives that are good enough to pay for so that other people know there are other people using other options.

But full disclosure, sometimes I'm using DuckDuckGo and it's also good enough most of the time that I occasionally forget until I go down some rabbit hole and realize that I'm using the wrong search engine.

jasonvorhe

5 months ago

Whenever I fall back to Google and see how terrible it has become I feel sorry for everyone still using it as their main search engine so I tend to link people to kagi because it's just so much better. Especially the customization aspects. I also like the idea of mainstreaming to pay for critical services like search. No paid shilling whatsoever. Back in the early 2000s people used to drop links to Google whenever search engines where discussed because the alternatives were mostly bad.

Today we have Brave and the alternative Bing frontends but Kagi is still unrivaled because how easy it is to remove shitty results.

lelandbatey

5 months ago

Nope, it's just a nice thing I like. It is nearly the platonic ideal of a search engine for me. It causes me no problems and doesn't try to sell me garbage.

It's like discovering that there a better pair of shoes that're more comfortable. Everybody can use a slightly improved more comfortable pair of shoes, so it comes up frequently.

testdelacc1

5 months ago

Disclaimer: Not a Kagi user. Unlikely to use it.

I just don’t understand people who get so upset that someone might like something enough to talk about liking it. So upset that they won’t ever try the thing. Like … ok I guess? You do you. It’s just a strange way to make decisions.

At least this is just a consumer product. Worse is when people here say they make technical decisions using the same process. They’d black list certain tech because they’ve heard people talking about how it solved their problems. Also ok, but now I know I should avoid them professionally.

mdaniel

5 months ago

I get the impression it's the volume of the folks who sing its praises. There was a web3 crowd for a while, Bitwarden champions would show up to any mention of a password manager, and (ahem) some AI champions can be over the top

In all of these cases, a reasonable counterpoint is that if it were that applicable for all audiences, one wouldn't need to sing its praises, it would sing its own praises

ufmace

5 months ago

It sings its own praises... how exactly? Maybe by a bunch of happy users talking about how they like it and it's a better solution to the problem that the thread or article is about without being explicitly paid? Which is exactly what's happening here and some people are complaining about it?

testdelacc1

5 months ago

How does a password manager sing its own praises?

koakuma-chan

5 months ago

I tried it, it's slow and bad and free tier is only 100 requests, and it's too expensive, and price is unjustified. I use gemini with google search grounding.

alexjplant

5 months ago

I understand skepticism in the age of LLM-generated content and CAPTCHA-solving bots. What I don't understand is why people choose such weird hills to die on and think that posting about it will accomplish anything. Do you think people will read your comment and go "gee, I was going to use Kagi but now I won't because this random person has a bad feeling about a series of comments they remember seeing"?

I signed up for a specialist forum not too long ago and posted an honest review of a product because I hadn't been able to find one anywhere on the internet. Immediately a bunch of people accused me of being a "shill" for a direct-to-consumer business that's been powered by a Yahoo storefront for the last 20 years, as though a business that's run by a guy with an AOL e-mail address is sophisticated enough to figure out Fiverr and astroturf their reputation on a phpBB forum.

Think about it for just a moment - do you really think that the Hacker News audience is large enough or full of enough tastemakers to sway an alternative search engine's market share? It isn't. If Kagi wanted to do that they'd hire TikTok influencers.

throwaway290

5 months ago

no one else would pay for search. people on HN is probably 90% of their total possible market.

yelling_cat

5 months ago

I love this project, but that server setup makes me hope that Ryan doesn't live in earthquake country.

ytrt54e

5 months ago

Crashed? The curse of Hacker News!

elite_barnacle

5 months ago

Reminds me of this XKCD: https://xkcd.com/908/

mips_avatar

5 months ago

It’s amazing what indie builders are doing with vector search, but I’m not sure how long it will last. Pure vector search works well today largely because no one is seriously trying to game it yet. Once adversaries start targeting it like they do SEO, we could see the same problems. You can already glimpse the risk in Pinterest, where roughly half the results for many queries are AI slop - since their primary search is image vectors

vlucas

5 months ago

> “I think it’s definitely lowered the barrier,” Lin says of the LLM’s role in enabling DIY search engines. “To me, it seems like the only barrier to actually competing with Google, creating an alternate search engine, is not so much the technology, it’s mostly the market forces.”

Oh sweet summer child

Oarch

5 months ago

I'm sure there's a money laundering joke in here somewhere

p3rls

5 months ago

i've been thinking that google could use its own AI to evaluate URLs instead of relying on pagerank and backlinks which are almost completely valueless as a signal in 2025. in my niche there's more slop than ever being produced daily and it's all hitting rank 1. it's tragic what google is doing to the internet.