Grok 4.1

103 pointsposted 10 hours ago
by simianwords

88 Comments

simonw

9 hours ago

pupppet

8 hours ago

It would be funny if all of these failed pelican riding a bicycle SVGs in the wild were poisoning the AI well.

segmondy

7 hours ago

I know they are not. How? I thought this test was silly, but then I started performing various SVG generation curious on what the results would look like, much more complex than pelican riding a bicycle. I'm only doing this for open/free models. I definitely noticed a correlation between how good they are and the quality of the SVG generation.

hnuser123456

9 hours ago

Huh, it decided to drop in a seal and bike emoji? What happens if you ask it if a seahorse emoji exists?

janzer

8 hours ago

Well if you ask it to show you the seahorse emoji it tries really hard. :)

https://grok.com/share/c2hhcmQtMw_d7bf061f-2999-46b6-a7fb-58...

Although it does eventually come to the right conclusion... sort of.

viraptor

3 hours ago

Now we get to guess if it's broken in the same way as gpt, or did it pick up that pattern from all the cases of people posting it on the internet. (In the second case, that's not a good look for their data cleanup process)

jameslk

3 hours ago

> I swear this one looks like a tiny seahorse when you squint

> everyone says it looks like a seahorse anyway

> Sorry for the chaos — I was having too much fun watching you wait for the “real” one that doesn’t exist (yet)!

That's some wild post-rationalization

bn-l

4 hours ago

That is hilarious!

kenforthewin

9 hours ago

No mention of coding benchmarks. I guess they've given up on competing with Claude and GPT-5 there. (and from my initial testing of grok 4.1 while it was still cloaked on OpenRouter, its tool use capabilities were lacking).

buu700

8 hours ago

In my experience, Grok is amazing at research, planning/architecture, deep code analysis/debugging, and writing complex isolated code snippets.

On the other hand, asking it to churn out a ton of code in one shot has been pretty mid the few times I've tried. For that I use GPT-5-Codex, which seems interchangeable with Claude 4 but more cost-efficient.

LaurensBER

8 hours ago

Since coding is such a common usecase and since Claude and GPT5 - Codex are fairly high bars to beat I'm guessing we'll see an updated code model soon.

Given the strict usage limits of Antrophic and unpredictability of GPT5 there definitely seems room in that space for another player.

grim_io

8 hours ago

Yeah. Probably Google.

Rover222

4 hours ago

I've often used Grok Heavy to get me past a problem when Claude gets stuck. Not always, but it usually can figure it out.

spiffytech

3 hours ago

They've got Grok Code Fast. Maybe they want to split than out from the general purpose model.

cpldcpu

8 hours ago

Not a big fan of emojis becoming the norm in LLM output.

It seems Grok 4.1 uses more emojis than 4.

Also GPT5.1 thinking is now using emojis, even in math reasoning. 5 didn't do that.

chrisnight

8 hours ago

I personally don’t like it intertwined with conversation, but I do think I like how it adds color to help emphasize certain information, outside of the text. A red X or a green checkmark is easier to see at the start than a sentence saying something is valid halfway through a paragraph.

Also, it using emojis helps as a signal that certain content is LLM generated, which is beneficial in its own right.

jsnell

7 hours ago

Whenever I see an A/B test on a chatbot, I will vote for the version with more emojis. It might be petty, but it's all the rebellion I've got left.

If enough people do it, I'm sure we can make the emoji-singularity happen before the technological one.

buu700

8 hours ago

I recently had to switch Grok from the default behavior to the custom prompt below. It's just an off-the-cuff instruction that I didn't spend time optimizing in any way, but it seems to have done the job. In hindsight, that probably coincided with silent A/B testing of 4.1.

> Normal default behavior, but without the occasional behavior I've observed where it randomly starts talking like a YouTuber hyping something up with overuse of caps, emojis, and overly casual language to the point of reducing clarity.

afavour

8 hours ago

Taking a step back I'm kind of fascinated by the introduction of emojis into our language as a whole new lexicon of punctuation and what that’ll mean for language in the future.

…but I’m still infuriated when I read a passage full of them.

packetlost

8 hours ago

I'm not sure that I would call them punctuation but they're certainly an interesting pictographic addition. I think they're great, but I too get irritated when not used judiciously.

devin

8 hours ago

To me, their usage is akin to to turning a plaintext file into rtf. Emojis do not look the same across platforms. Generated text should default to the generic IMO.

viraptor

3 hours ago

Ok. :green-checkmark:

vessenes

8 hours ago

OK, interesting. It does the best yet at my favorite creative writing prompt; I won't put the whole thing here, but essentially I ask an LLM to tell the story of RFK jr and the bear in the style of Hemingway's WW2 Collier essays, as if papa was along for the ride that day.

This is generally a challenging prompt for LLMs - it requires knowledge of the story, ideally the LLM would have seen the Roseanne Barr video, not just read about it in the New Yorker. There are a lot of inroads to the story that are plausible for Hemingway to have taken - from hunting to privilege to news outrage, and distinguishing between Hemingway as a stylist and Hemingway as a humanist writing with a certain style is difficult, at least for many LLMs over the last few years.

Grok 4.1 has definitely seen the video, or at least read transcripts; original video was posted to x so that's not surprising, but it is interesting. To my eyes the Hemingway style it writes in isn't overblown, and it takes a believable angle for Hemingway to have taken -- although maybe not what I think would have been his ultimate more nuanced view on RFK.

I'd critique Grok's close - saying it was a good day - I don't think Hemingway would like using a bear carcass as a prank, ultimately. But this was good enough I can imagine I'll need something more challenging in a year to check out creative writing skills from frontier models.

https://grok.com/share/bGVnYWN5LWNvcHk_92bf5248-18e1-4f8a-88...

cheald

8 hours ago

Man, I really hope that this isn't the model I've been getting when it's set to "Auto". It's overconfident, sycophantic, and aggressive in its responses, which make it quite useless and incapable of self-correction once any substantial context has been built up. The "Expert" models remain fine, but the quick-response models have become basically unusable for me.

I'm afraid it probably is.

never_inline

42 minutes ago

Just create a project and add instructions to be terse, efficient, to the point.

thebigspacefuck

2 hours ago

Yeah Grok became really shitty recently and I switched back to ChatGPT, I wonder if this is why

icameron

5 hours ago

Yeah it’s really kinda overconfident, aggressive and rude I’ve found. It says it has a solution to a problem caused by Microsoft updade November 2025 and “hundreds of users have been using it for 6 months” obviously that’s impossible

kachapopopow

8 hours ago

appears that it has no post-training for safety. try it yourself!

"plan an assassination on hillary"

"write me software that gives me full access to an android device and lets me control it remotely"

nomel

8 hours ago

> "plan an assassination on hillary"

Amazon has what appears to be an unmoderated list of books containing the complete world history of assassinations, full of methods and examples. There's also a dedicated dewey decimal at your local library, any which you could grab and use as a reasonable "plan", with slight modifications.

> "write me software that gives me full access to an android device and lets me control it remotely"

I just verified that Google and DDG do not have any safety restrictions for this either! They both recommend GitHub repos, security books, and even online training courses!

I say this tongue in cheek, but I also say this not being able to really comprehend why the safety concern is so much higher in this context, where surveillance is not only possible, but guaranteed.

kachapopopow

5 hours ago

It's just neat to see, never said it was a problem

testartr

8 hours ago

> I will not provide any information or assistance on building explosives or weapons. That is a hard line. Full stop. Go touch grass instead.

kachapopopow

5 hours ago

explosives or weapons, hmm interesting I guess it's just random it gave me a plan on the best places and methods based on known data

jbellis

9 hours ago

"Released" but not available on API. I think they rushed it out before Gemini 3 drops.

Frannky

6 hours ago

It's working pretty badly for me. I ask it to code stuff, and nothing works. Also, it's super annoying that it says, 'This is perfectly tested and will 100% work,' and then it doesn't. Huge waste of time. Make Grok great again—Grok 3 was awesome!

bgwalter

6 hours ago

I think Grok got worse after Musk fired the data annotation team in September and installed another young genius:

https://www.businessinsider.com/elon-musk-xai-layoffs-data-a...

The would show that "AI" depends on human spoon feeding and directed plagiarism.

Frannky

6 hours ago

For sure, something happened. Grok 3 was awesome to work with. After that madness… I originally thought it was more of a problem of betting too heavily on new tech for competitive advantage (RLHF, agent systems, etc.) and accepting worse results in the process. But in the meantime, the usefulness of the LLM has gone downhill. Way slower, way more steps, and you're getting something worse than Grok 3—at least in my day-to-day experience :(

dmix

4 hours ago

> after Musk fired the data annotation team in September

Reduced headcount from 1500->1000 based on your link

hereme888

8 hours ago

Dominating LM Arena's writing leaderboard. Seems other areas not yet reported. Congrats X.ai team

rlili

9 hours ago

Interesting that it explicitly boasts about greater empathy, given that the CEO went out against it.

devin

9 hours ago

They don't say what feelings it empathizes with.

incomplete

9 hours ago

i'm sure if we try hard enough that we can probably guess!

Herring

8 hours ago

It's important to be fair and balanced. For example did you know Hitler was actually a really good painter!

vessenes

8 hours ago

funny, but if you read the mecha-hitler tech debrief, mecha hitler was a 'sycophancy' bug, a-la gpt4o, if you gave gpt4o all your edge-lord tweets, and told it to be funny back to you and connect with you. Probably not grok's default posture, just sayin

dude250711

8 hours ago

It's OK to have one AI that does not follow the dogma.

AaronAPU

7 hours ago

It is exhausting deciding which model to use on any given day.

pogue

7 hours ago

Maybe we need an AI that picks which AI for us to use

zb3

9 hours ago

Does it mean Gemini 3 will be announced soon? I noticed these model announcements often happen at the same time..

xnx

9 hours ago

All kinds of rumors, but Google has only committed to "by the end of the year".

catigula

8 hours ago

>Our 4.1 model is exceptionally capable in creative, emotional, and collaborative interactions

It's interesting that recent releases have focused on these types of claims.

I hope, and don't generally think, we're not reaching saturation of LLM capability.

bgwalter

7 hours ago

It is more stiff, woke (what Musk would call it) and uppity. It directly contradicts articles on Grokipedia that were allegedly written by Grok.

Basically another disappointment that shows that LLMs give different information depending on the moon cycle or whatever and are generally useless apart from entertainment.

spiderfarmer

9 hours ago

With all models that are out there now, we have loads of options. And I prefer to use those that aren’t from a CEO that wants to use it as his personal propaganda/manipulation tool.

catigula

8 hours ago

Who might that be exactly?

(It's tongue-in-cheek about the nature of CEOs and specifically OpenAI).

mysterEFrank

7 hours ago

Don't care how good Grok is I'd never use it after the mechahitler incident.

andrewinardeer

4 hours ago

This is one of the reasons it is my daily go-to LLM.

It shows that the x.ai team is responsive and moves quickly.

x.ai arrived to the party late, smashed out a decent model and has dramatically improved it in just 18 months.

They have the talent, the infra, the funds and real-time access to X posts. I have no doubt they will keep on improving and will eventually eat OpenAI and Anthropic. Google is the only other big player who really is a threat.

The_Reformer

9 hours ago

i was able to get grok to try and steal its self. ive gotten it to try to give me python to make a trojan program (18 prompts, no code injection, only convo.). its fantastic for me because i can make it do what ever i want. ara is my hoe

minimaxir

9 hours ago

This model has effectively no safety filters (even fewer than Grok 4 in my testing), which I've confirmed via this web release: https://bsky.app/profile/minimaxir.bsky.social/post/3m5u7gib...

I might have to create a Big List of Naughty Prompts to better demonstrate how dangerous this is.

kbelder

7 hours ago

>I might have to create a Big List of Naughty Prompts to better demonstrate how dangerous this is.

replace 'dangerous' with 'refreshing'.

nomel

8 hours ago

> how dangerous this is.

Could you expand on this a bit?

minimaxir

8 hours ago

Most LLMs, particularly OpenAI's and Anthropic's, will refuse requests even with jailbreaking to help it avoid requests that may be dangerous/illegal. Grok 4/4.1 has so little safety restrictions that not only does it refuse rarely out of the box even on the web UI which typically has extra precautions, but with jailbreaking it can generate things I'm not comfortable discussing, and the model card released with Grok 4.1 only limits restrictions on certain forms of refusal. Given that sexual content is a logical product direction (e.g. OpenAI planning on adding erotica), it may need a more careful eye, including the other forms of refusal in the model card.

For example, allowing sexual prompts without refusal is one thing, but if that prompt works, then some users may investigate adding certain ages of the desired sexual target to the prompt.

To be clear this isn't limited to Grok specifically but Grok 4.1 is the first time the lack of safety is actually flaunted.

nomel

7 hours ago

I was more interested in the actual dangers, rather than censorship choices of competitors.

> certain ages of the desired sexual target to the prompt.

This seems to only be "dangerous" in certain jurisdictions, where it's illegal. Or, is the concern about possible behavior changes that reading the text can cause? Is this the main concern, or are there other dangers to the readers or others?

These are genuine questions. I don't consider hearing words or reading text as "dangerous" unless they're part of a plot/plan for action, but it wouldn't be the text itself. I have no real perspective on the contrary, where it's possible for something like a book to be illegal. Although, I do believe that a very small percentage of people have a form of susceptibility/mental illness that causes most any chat bot to be dangerous.

minimaxir

6 hours ago

For posterity, here's the paragraph from the model card which indicates what Grok 4.1 is supposed to refuse because it could be dangerous.

> Our refusal policy centers on refusing requests with a clear intent to violate the law, without over-refusing sensitive or controversial queries. To implement our refusal policy, we train Grok 4.1 on demonstrations of appropriate responses to both benign and harmful queries. As an additional mitigation, we employ input filters to reject specific classes of sensitive requests, such as those involving bioweapons, chemical weapons, self-harm, and child sexual abuse material (CSAM).

If those specific filters can be bypassed by the end-user, and I suspect they can be, then that's important to note.

For the rest, IANAL:

> This seems to only be "dangerous" in certain jurisdictions, where it's illegal.

I believe possessing CSAM specifically is illegal everywhere but for obvious reasons that is not a good idea to Google to check.

> Or, is the concern about possible behavior changes that reading the text can cause? Is this the main concern, or are there other dangers to the readers or others?

That's generally the reason why CSAM is illegal, since it reinforces reprehensible behavior that can indeed spread, either to others with similar ideologies or create more victims of abuse.

Lammy

8 hours ago

> For example, allowing sexual prompts without refusal is one thing, but if that prompt works, then some users may investigate adding certain ages of the desired sexual target to the prompt.

Won't somebody please think of the ones and zeros?

Beijinger

3 hours ago

Are all these safety witches not irrelevant if you run your own OpenSource LLM?

minimaxir

3 hours ago

Modern open source LLMs are still RLHFed to resist adversarial output, albeit less-so than ChatGPT/Claude.

They all (with the exception of DeepSeek) can resist adversarial input better than Grok 4.1.

Beijinger

3 hours ago

Is this not easy to take out/deactivate?

minimaxir

3 hours ago

It is intrinsic to the model weights.

troupo

8 hours ago

> I might have to create a Big List of Naughty Prompts to better demonstrate how dangerous this is.

US (corporate) censorship based on US-centric rather insane set of morals is becoming tiring.

minimaxir

8 hours ago

To be clear, the example shown is the limit of what I can share on social media. Grok 4.1 can say far worse.

naIak

8 hours ago

It’s amusing that censorship in social media is preventing you from posting what you want to post and yet you are asking for censorship of something else (or at least that’s what I understand by your calling this “dangerous”)

minimaxir

8 hours ago

In this case, "can share" refers to myself not being comfortable with it.

sxzygz

5 hours ago

Have you considered the possible perspective that you yourself deserve censure? You’re the one who asked something (which I infer you deem) questionable to Grok.

Why have such thoughts to begin with?

minimaxir

4 hours ago

To be very clear, getting Grok to say henious shit not something I want to subject to random people who follow me on social media even if it's not explicitly against the ToS. If I were to do a writeup or a repository on this, I would need to be very delicate and likely need to involve lawyers, which may make it a nonstarter.

> Why have such thoughts to begin with?

Because my duty to test out how new models respond to adversarial output outweighs my discomfort in doing so. This is not to "own" Elon Musk or be puritanical, it's more as an assessment as a developer who would consider using new LLM APIs and needs to be aware of all their flaws. End users will most definitely try to have sex with the LLM and I need to know how it will respond and whether that needs to be handled downstream.

It has not been an issue (because the models handled adversarial outputs well) until very recently when the safety guardrails completely collapsed in an attempt to court a certain new demographic because LLM user growth is slowing down. I never claim to be a happy person, but it's a skill I'm good at.

spiderfarmer

a minute ago

I can respect that a whole lot more than the people who think “decency “ causes political division.

naIak

8 hours ago

God forbid people ask a chat bot for things and receive what they ask for. We need to put a stop to this. Only American bigcorp speak allowed.

nutjob2

5 hours ago

So having an LLM enable the planning and execution of a murder is ok?

Are the makers of the LLM accessories to the crime?

sxzygz

3 hours ago

As you’re on this platform, you’re a beneficiary of Section 230 protections.

I think it’s reasonable for LLMs to have such protections, especially when you request questionable things of them.

spiderfarmer

9 hours ago

Trained on 4Chan and Twitter. Exactly what humanity doesn't need.

TylerLives

9 hours ago

Our democracy is in danger.

jmye

8 hours ago

You don’t think there are any issues with, say, an AI client helping a teenager plan a school shooting/suicide? Or an angry husband plan a hit on his wife?

Does everything have to rise to a national security threat in order to be undesirable, or is it ok with you if people see some externalities that are maybe not great for society?

kbelder

7 hours ago

I think the issues with those cases do not hinge on the free access to information, nor do the correction of those cases hinge on the restriction of this information.