Comment FlashAttention (Score 2) 43
I did some math the other day on running local AI models and the net result is most homes can't afford to run the current median models.
They don't just need 80GB of VRAM, they need newer architectures - to be supported by CUDA, to be supported by pytorch, etc.
These problems may well be solvable with more clever use of hardware, MoE, acceptable quantization, etc., but today you're in for several grand and something north of 100W idle to use what is effectively a $20/mo plan.
A small enterprise can afford local, so that's good. We paid more than that for one SGI machine back in the day.
The point of the exercise was to plot the position on the curve. We're at something like 2006 YouTube where nobody could afford the drives or bandwidth that YouTube/Google was giving away for free (aka with VC money). Eventually hard drives got cheaper, people got gigabit at home, FlashServer was replaced with h.264/HTML5, phones could stabilize video locally, etc.
So it looks like these AI companies need to stay alive for about seven more years giving away product at a loss, or at least highly oversubscribed, to turn a profit. Hence the low token allowance, the banning of OpenClaw, etc.
On the other hand, I read the blog of a security researcher yesterday who found an exploit with (IIRC) Claude, tried to refine the PoC, but got dinged on "out of tokens" before he could finalize it. So he just deleted the work and moved on.
It sounds like they're trying to not lose money at such a velocity and are trying to find a sweet spot where people don't just declare it too underpowered to use.
A global energy depression may well take out the supermajority of the companies that believe they can burn investment money for seven more years. There is circular financing money, then there is real return on capital money. One is to fool the markets, the other is grounded in current physics.