Comment Re:Mathematician commentary included (Score 1) 83

by mesterha on Sunday May 31, 2026 @01:42PM (#66168240) Attached to: OpenAI Claims It Solved an 80-Year-Old Math Problem

Kidding aside, doesn't this exchange terrify you just the least little bit ? We're seriously using this technology out-of-the-lab ? Did it ever pass its regression test suite (did it ever have one?) ? Is all the universe a beta for these tools ?

Interesting. I was probably using the same model, but I do have a paid account so mine comes with thinking which doesn't seem available on the free account. However, I tried again with a browser were I wasn't logged into Google and the non-thinking 3.5 Flash gave me essentially the same answer I got the first time.

But I don't essentially disagree with either answer. They are just biased to different perspectives. And somewhat negative answers are not surprising since a lot of that is the opinion of what it was trained on. (LLMs might need therapy.) My opinion is that we don't know what these models are doing and simplistic statistical explanations that they are just predicting the next most likely word are misleading. While those numbers do drive the algorithm, there is no grounded way to accurately measure the probability of the next word that follows from the previous 10000 tokens. And there is solid evidence it is actually "planning" multiple words in advance. Current models even explicitly build that into the training.

But yes it is a bit problematic for production deployment if these models are so unpredictable. However, I'm more interested in the research side. Both in understanding the models and using them to help with research. For proofs, it's fine. A proof is always independently verified before being published.

Comment Re:Mathematician commentary included (Score 1) 83

by mesterha on Sunday May 24, 2026 @01:39PM (#66158558) Attached to: OpenAI Claims It Solved an 80-Year-Old Math Problem

do llms perform logical inference

I put in your query into Gemini. It said

The short answer is yes, but not in the way a traditional computer program or a human logician does. LLMs can successfully solve complex logic puzzles, deduce conclusions from premises, and write flawless mathematical proofs. However, because they are built on probabilistic text prediction rather than an explicit "logic engine," their reasoning capabilities have fascinating quirks and strict limitations.

It might give a different answer because of memories or just plain randomness. Or it might be that you actually had a longer conversation and pushed it your way. Those models love to please. I hope you didn't just try to get it to agree with you.

Then I ask "How is their logical inference similar to humans particularly when they are allowed to use reasoning techniques and tools", and it said.

When we give an LLM the space to "think" (via extended reasoning chains) and the ability to interact with tools (like code interpreters, calculators, or search engines), its process begins to look remarkably similar to human dual-process cognition. Psychologists often map human intelligence into two systems: System 1 (fast, intuitive, subconscious, and prone to cognitive biases) and System 2 (slow, deliberate, analytical, and rule-following). By default, a standard LLM operates purely like System 1â"it spits out the next word based on intuition and pattern matching. But when we equip them with reasoning techniques and tools, they effectively simulate a System 2 mind.

I would guess you are using these models the wrong way. Ask it to clarify issues and define terms. Push against what it says. I see a lot of people using these things as echo chambers to try and win an argument. They don't have the background to understand the topic, but they just skim the answer and query it to get what they are looking for. You can actually learn what generalization means in ML if you just ask the model. This knowledge will give you the ability to say some interesting.

Comment Re:Mathematician commentary included (Score 1) 83

by mesterha on Saturday May 23, 2026 @04:26PM (#66157490) Attached to: OpenAI Claims It Solved an 80-Year-Old Math Problem

Humans have agreed on the notion of implication (in the logical, 100%/Boolean inference sense) for thousands of years, and have been able to check each other's proofs for the same length of time.

You need to brush up on your history of science and math. Notions of proof have changed over time and many mistakes have been found and fixed. Published papers are loaded with mistakes. Most are fixable but not always. And LLMs can fit right into this framework. They can generate proofs, just like humans, that will go through peer review and be accepted or rejected by a consensus of experts. For some subset of problems, we can even formalize the proof and run verification on it.

I'm guessing there's something qualitatively different between present-day neural networks and the human brain, some kind of ability to generalize

Sure they are different, but they do have something in common. We don't really know how either of the successfully generalize.

Which is not to say that NNs may not exhibit such behavior, I just haven't seen it yet.

Well you don't really seem to know the history or the terminology. You should learn that first. A good way is to ask a modern LLM. They can teach you all about this stuff.

Comment Re:Mathematician commentary included (Score 1) 83

by mesterha on Friday May 22, 2026 @11:08PM (#66156710) Attached to: OpenAI Claims It Solved an 80-Year-Old Math Problem

So what do you mean by mean by statistical model? Some people claim there are logical ML models (which few use) and everything else is a statistical model, so it might be easier for you to define a logical ML model.

It might seem like I'm being pedantic, but it's important to establish these definitions as I feel the statistical qualifier is often misleading. It's often confused with probability which is the easier problem. Probability starts with a model and answers questions about the data the model generates. Statistics is the much harder inverse problem. Start with the data and create a model that generated that data.

So are LLMs building probability models? Sure, but not in the way most people imagine. Does this mean they can't output a logical proof. I don't see any evidence to support this. I'm guessing you're hung on the fact they don't verify the proof with 100% certainty, but that's also true of humans. And just like a human, if so desired, they can use tools to verify the proofs. Or more typically, they give the result to others and have them verify it.

Comment Re:Mathematician commentary included (Score 1) 83

by mesterha on Friday May 22, 2026 @01:03AM (#66155208) Attached to: OpenAI Claims It Solved an 80-Year-Old Math Problem

The words to describe the process that you have identified are statistical inference, not logical inference. I don't believe that you can square that circle; it's why NNs are said to interpolate, but not extrapolate. But my beliefs are, shall we say, flexible -- I'm open to a counter-argument.

So what is statistical inference to you and how does it related to what LLM models do? While were at it, what is extrapolation? I feel like you're using a lot of terminology in an incorrect or at least imprecise way.

Comment Re:Mathematician commentary included (Score 1) 83

by mesterha on Friday May 22, 2026 @12:56AM (#66155194) Attached to: OpenAI Claims It Solved an 80-Year-Old Math Problem

It's not clear exactly what the AI did, since it was "human-digested, somewhat simplified, and somewhat generalized."

The mathematicians in the paper you linked all contend that the AI came up with the key counter example. Just because humans wrote it in a more readable form that is slightly generalized doesn't invalidate the the counter example the AI came up with works for the original conjecture. Are you going to say the same of Grigori Perelman's proof.

This quote from Melanie Matchett Wood is clarifying:

If by clarifying, you mean misleading. Wood is just pointing out that the AI didn't give proper citations, and because it was trained on such related it work it should. This is a famous 80+ year open conjecture and no one is claiming that it was simple to take the existing work to generate this result. No ideas are truly independent from previous work and one should include proper citations. For these systems, provenance can be difficult.

Comment Re: So, they invented... (Score 1) 262

by mesterha on Wednesday April 08, 2026 @12:50PM (#66083436) Attached to: CIA Reportedly Used Secret Quantum Tool To Find Downed Airman in Iran

I have some experience with SQUIDS and while you can do some cool things with them the idea you could isolate a human heartbeat beyond a few yards or meters is nonsense.

It's probably what they told Trump. Classic disinformation.

Comment Re: Anyone got examples (Score 0) 61

by mesterha on Tuesday April 07, 2026 @11:52PM (#66082478) Attached to: Anthropic Unveils 'Claude Mythos', Powerful AI With Major Cyber Implications

They have multiple documented patched zero days and provided sha3 verifiable hashes for ones that will be released in the next 135 days.

But I'm sure they trained on this code. It's just repeating it's training data. There is no intelligence.

And yes, I'm kidding since otherwise someone will take me seriously.

Comment Re: Nice AI you have here (Score 1) 195

by mesterha on Wednesday February 25, 2026 @01:23PM (#66009834) Attached to: Hegseth Gives Anthropic Until Friday To Back Down on AI Safeguards

I don't trust Trump's uncle. According to Trump, he knew who the Unibomber was well before his family identified him. Of course, that means Trump also knew. That guy is incredible, in the literal sense of the word.

Comment Re:Duh (Score 1) 84

by mesterha on Monday February 23, 2026 @11:45PM (#66006886) Attached to: LLM-Generated Passwords Look Strong but Crack in Hours, Researchers Find

Did you play a bit with the sampler demo?

I must have missed that.

If we have a smart enough LLM, I think there is no reason why it shouldn't produce a uniform distribution while generating letters for the password. Maybe a LLM could be trained to get mostly uniform Top 20 tokens when the current sequence is part of a password to be generated. So I think you have a good point.

In principle, but the current models are pretty smart just to run a bash command to generate a password. That password probably uses a better random number generator. The one used by an LLM algorithm to pick tokens is probably fast but a cryptographer would never use it.

They for example prompt an image model "Create a unique video game character of a plumber" or similar and wonder why they get Mario, when they said "unique".

I tried with Gemini. I guess the result is good. It actually generated a video which isn't what I expected. It created a futuristic Asian cyborg plumber working in a boiler room. That also might fit with what I've seen with unique Gemini queries. Things are moving fast and the uniqueness issue might have been addressed as a random combination of "things" given the constraints of the query. However, I can't test many videos as I can only generate 2 a day.

Do that and you can write a paper about it. Even if it only works partially, a good analysis will be an interesting result.

I've kind of given up writing papers. So many papers get submitted that they all get dumped onto grad students who generally don't know what they are doing. I guess now it's LLMs that review the papers. With the current state of the art, they would do a better job.

Comment Re:Duh (Score 1) 84

by mesterha on Sunday February 22, 2026 @12:45PM (#66004110) Attached to: LLM-Generated Passwords Look Strong but Crack in Hours, Researchers Find

Again I appreciate the effort. It's clear you're thinking about the problem in a way that helps understanding.

First: Passwords. If you ask for a good password, the likely answer is a good password. In isolation it probably is a good password even when you risk that it has the properties of example passwords in the training set.

It probably looks like a typical password, but a good password doesn't exist in isolation. What you really want is a set of potential passwords with a way to pick uniformly over that set.

A good LLM may now determine that it should for example not sample the same token again and again, as it learned that that is a easy-to-crack password.

Maybe yes maybe no. That is not a good way to generate passwords and it might "understand" documentation that explains that.

You already see, that the password entropy is now just 3^length. If you would sample uniformly from ALL tokens, you would get a password generator. But all other parts of the output would also look like passwords.

Entropy is normally measured in bits, so you need to take the lg of that, but it's probably more intuitive to talk out the number of possible outputs, so OK.

What you are saying is that it's not a good password generator, sure. This is consistent with the article, and what I said. The LLM can use randomness to generate about 1/4 the entropy you would get from uniform sampling over the set of all character strings of length 16. In terms of counting that's maybe 2^25 which is just 32 million passwords which can be bruteforced in certain situations. That's a big difference from 2^100 which is a quadrillion quadrillions.

So I agree, it's a bad password generator, but it still has access to randomness, so it can generate passwords they are just not optimal for their length. (The more troubling part is that it sometimes fails to do the right thing and might generate some passwords with much high probability.)

Unique is a bit similar problem. People complained for example that image generators do not understand "Generate a unique SUBJECT".

As I said, it's not very well defined. Dictionaries have multiple definitions for a reason. This alone is enough to cause issues, but it gets more complex by speculating what the LLM is doing. However, it is an interesting question.

In image generation it may be an image that was tagged as "So very unique", but if all generations get close the concepts in that image, none of them is unique anymore. The LLM test before associated fantasy styles names with "unique".

Mathematically, if they are close, they are still unique. In this context, what a human probably wants is something semantically meaningful that is sufficiently far, with some metric, from all other images that have ever been generated. Of course, without knowledge of what all other generators are doing, unique, in this sense, is impossible. So at best you're back to a probabilistic solution. You need to define a space of images and have the algorithm uniformly pick from that space. Just playing with a LLM, it seems to do a simple form of this. It combines a bunch of semantic ideas to create a "unique" image. On gemini, a clock owl that is on a frozen lightning branch in a cosmic storm.

These things work by "associations" and when they are primed on something they often follow predictable patterns.

Yes, there was a blog that talked about how if you played bad chess with an LLM it would also play bad, but if you played a strong game against it, it would improve.

Just read a bit the complaints in the creative writing communities. Try to google for some stories about Elara and you see how LLM even fail at creating unique names.

Interesting, and it still shows up in current models. It's probably a bit of manifest destiny since this issue goes back to 2022. However, when I ask for a unique name for my sci-fi lead, the model claims to do something similar to what I described with images and comes up with some weird names: Vyrith-Esh, Xylanthe-Vane, Kyzant-Nu, Zylpha-Kore.

Comment Re:Duh (Score 1) 84

by mesterha on Saturday February 21, 2026 @01:40PM (#66002858) Attached to: LLM-Generated Passwords Look Strong but Crack in Hours, Researchers Find

I appreciate the effort in your answer, but I still disagree. First it's important to establish what is the accepted definition of random, which is not trivial. As I said, I'm using pseudorandom (otherwise one needs specialized hardware or an argument based on the randomness of inputs that are used to generated an entropy pool) which means technically it's deterministic, but that it satisfies various randomness tests.

When doing such a test, you repeatedly called the generator which they don't do. Instead they make a new call to the LLM which seems similar to calling the pseudorandom generator with the same random seed which will always give the same answer, but they do get some variety probably because the generator the LLM used on the server is in a different state on each call.

Instead, what they are doing is a use case test where different people get passwords based on a cold query to a LLM. And I guess it fails, but the articles "claims" are a bit inconsistent, so who knows what the actual research claims. It's also interesting that if I ask claude code for a password it runs a linux command to generate a password which is the right answer. Just like a human, you should use the proper tool.

So what about your claim that because an LLM has no memory, it can't generate something "unique". I'm not sure why you focus on unique and names when the topic is passwords. You go from something well defined to less defined, but whatever.

In principle, I could train (with a suitable training set) an LLM that when I want to generate passwords it should have uniform probability over the tokens that consist of single allowed password characters. Of course, training a LLM to do that doesn't make sense since the right answer is to give it a tool to do it, but here we're talking about what an LLM can do given it has no memory. So the specific algorithms used to make the choice for commercial models is unknown, but as you explained, it will pick using information from the distribution with a pseudorandom number generator. Let's assume it just picks based on the conditional probability of the top 3 choices, which is just a uniform distribution over those 3. This gives an entropy of about 1.58 bits per token which works out to 25 bits for a 16 bit string which matches well with the "claimed" results. So maybe it is doing something like that except when it hallucinates and just does the wrong thing.

Comment Re:Duh (Score 1) 84

by mesterha on Friday February 20, 2026 @08:33PM (#66002092) Attached to: LLM-Generated Passwords Look Strong but Crack in Hours, Researchers Find

As a LLM has no memory, asking it for something "unique" can't work.

Wrong. An LLM uses randomness to generate it's answers. In practice most likely a pseudorandom number generator, but that's generally considered good enough. Also the article claims they get 27 bits of entropy out of 16 characters. That means they are claiming it is random. And that's probably better than humans who pick their own passwords.

Comment Re:High end cables are a waste of money. (Score 1) 101

by mesterha on Wednesday February 18, 2026 @12:11AM (#65995744) Attached to: Blind Listening Test Finds Audiophiles Unable To Distinguish Copper Cable From a Banana or Wet Mud

Those high end cables are a waste of money.

Probably true, but peer reviewed audio research does show that they sound consistently different. See https://sites.google.com/view/...

As explained in the research, quickly switching between sources doesn't allow people to properly hear the subtle differences in these high end systems. When extending blind listening sessions to at least 15 minutes, untrained participants could start hearing differences between cables with strong statistical confidence.

While it's not understood why this is the case, there are several theories. One is that it's difficult to forget the details of what you just heard when rapidly switching. Unlike vision, we can't just reinspect something we hear by looking again. The brain has something called echoic memory which stores 3 to 4 seconds of auditory information which could be causing issues when rapidly switching. There's also theories that a large part of the brain is designed to predict the future. Again this could obscure differences with two signal that are so similar. The brains predictive abilities would just fill in any missing details. Last, and somewhat contradictory to the others, is that when quickly comparing two sources you're forcing you brain to use working memory to hold differences. Long term testing lets you use more effective types of memory for making complex distinctions.

Basically it kind of means that these double blind tests that have been used for most audio tests is flawed. It's kind of like using an optical illusion to measure distance.

Comment Re:AI Hype needs money (Score 1) 106

by mesterha on Saturday February 14, 2026 @01:24PM (#65988896) Attached to: Spotify Says Its Best Developers Haven't Written a Line of Code Since December, Thanks To AI

It would be much more useful if you specify which AI tools you are using. They are not all equal.

Comment Re:Mathematician commentary included (Score 1) 83

Comment Re:Mathematician commentary included (Score 1) 83

Comment Re:Mathematician commentary included (Score 1) 83

Comment Re:Mathematician commentary included (Score 1) 83

Comment Re:Mathematician commentary included (Score 1) 83

Comment Re:Mathematician commentary included (Score 1) 83

Comment Re: So, they invented... (Score 1) 262

Comment Re: Anyone got examples (Score 0) 61

Comment Re: Nice AI you have here (Score 1) 195

Comment Re:Duh (Score 1) 84

Comment Re:Duh (Score 1) 84

Comment Re:Duh (Score 1) 84

Comment Re:Duh (Score 1) 84

Comment Re:High end cables are a waste of money. (Score 1) 101

Comment Re:AI Hype needs money (Score 1) 106

Slashdot Top Deals

Slashdot