j0ebaker - Slashdot User

Comment Who owns a virtual being? (Score 1) 99

by SuperKendall on Monday October 06, 2025 @10:46AM (#65706860) Attached to: Fake AI-Generated Actress Gets Agent - and a Very Angry Reaction from (Human) Actors Union

A more interesting question I think is, does anyone own this AI actress?

That is to say - if a company took her likeness, and used other AI to make porn - could "her" agent sue them?

Or in other words, is a purely AI generated likeness even copyrightable, when technically no human made it?

Comment Exactly Forward (Score 1) 39

by SuperKendall on Monday September 15, 2025 @10:51AM (#65660620) Attached to: Apple Claims 'Most Significant Upgrade to Memory Safety' in OS History

I don't give a shit if some Russian/Kazakh/Malaysian bot farmer wants to take over my phone.

So you do no banking on your phone? Unlikely.

For the 99% of people that do in fact use a phone for banking, protection from lower level criminals is invaluable. For most people there is real financial loss possible from a phone being taken over, at the very least to monitor banking access mechanisms.

Comment Trump - or AI? (Score 1) 321

by SuperKendall on Wednesday September 03, 2025 @04:52PM (#65637056) Attached to: America is in a Serious Jobs Slump

Unless Trump was the one who personally built all these models that companies are trying to replace workers with, I don't think he's much of the reason there are so many more people out of work.

Comment Same Difference (Score 1) 37

by SuperKendall on Wednesday August 27, 2025 @01:23PM (#65619390) Attached to: Wikipedia Editors Reject Founder's AI Review Proposal After ChatGPT Fails Basic Policy Test

ChatGPT making up things and incorrectly following policy would merely be in line with the current editorial pool, who ignores policy when it suits them and fabricates material if an article has any kind of political content.

Comment What you do not realize (Score 1) 119

by SuperKendall on Tuesday August 12, 2025 @06:01PM (#65585948) Attached to: UK Government Suggests Deleting Files To Save Water

What everyone does not seem to realize, is that MNAY of the choices we are making to "save the climate" are equally stupid. They are just less obviously stupid.

Submission + - GPT-OSS:20b Disappointment On Linux (josephwilliambaker.com)

Submitted by j0ebaker on Thursday August 07, 2025 @09:24PM

j0ebaker writes: # Why GPT-OSS:20B Runs at 7 Tokens/Second on Your RTX 4090 Laptop: A Deep Dive into LLM Performance Bottlenecks

After day of testing, debugging, and optimization attempts with GPT-OSS:20B on a high-end RTX 4090 laptop, we've uncovered some sobering truths about why even powerful consumer hardware struggles to achieve the performance levels we expect. This isn't just a Linux problem—it's a fundamental architectural limitation that affects how large language models interact with modern GPU hardware.

## The Performance Reality Check

Our RTX 4090 laptop with 16GB VRAM, Intel i9-13900HX, and Ubuntu 24.04 achieved a maximum of **7.4 tokens/second** after extensive optimization. This is with a 20-billion parameter model that should theoretically run much faster on paper. Here's what we discovered.

## The Layer Offloading Problem: Why 20% Must Stay on CPU

The most significant finding was that **only 20 out of 25 layers** (80%) could be offloaded to the GPU, leaving 5 layers permanently on the CPU. This isn't a bug—it's a fundamental constraint:

```
GPU Memory Allocation:
- Model weights: ~10GB (quantized)
- KV cache: ~2-4GB (16K context)
- GPU overhead: ~1GB
- Total: ~15GB of 16GB VRAM
```

The remaining 5 layers (20% of the model) must run on CPU because:
1. **Memory fragmentation**: Even with 16GB VRAM, contiguous memory allocation fails for the full model
2. **CUDA kernel overhead**: Each layer requires additional memory for temporary tensors
3. **Context window expansion**: 16K tokens consume significant KV cache memory

## Is This Ollama's Fault on Linux?

**Partially, but not entirely.** Our testing revealed several Linux-specific issues:

### Ollama's Linux Limitations
- **Service configuration**: Default systemd setup doesn't expose GPU devices properly
- **Memory allocation**: Linux memory management is more conservative than Windows
- **Driver integration**: CUDA 12.6 on Linux has different memory allocation patterns

### The Fix That Actually Worked
```bash
# Systemd override required for proper GPU access
[Service]
Environment="OLLAMA_LLM_LIBRARY=cuda"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=f16"
Environment="OLLAMA_NUM_THREADS=20"
```

This configuration improved performance from **2-3 tokens/second** to **7.4 tokens/second**, but hit a hard ceiling.

## The LLM Architecture Bottleneck

The real limitation isn't software—it's the model architecture itself:

### Memory Bandwidth vs. Compute
- **RTX 4090 Memory Bandwidth**: 1,008 GB/s (theoretical)
- **Actual Utilization**: ~60-70% due to memory access patterns
- **Sequential Nature**: Each layer must complete before the next begins

*** Warning — the next 5 lines might be a hallucination AI artifact. Yes, I used the CLINE plugin for VS Code to perform all these tests, and I don't think we actually did any quantization work...
### Quantization Trade-offs
We tested various quantization levels:
- **Q4_K_M**: 4-bit quantization, ~7.4 tokens/sec, acceptable quality loss
- **Q8_0**: 8-bit quantization, ~5.2 tokens/sec, better quality
- **FP16**: 16-bit precision, ~3.8 tokens/sec, best quality but memory intensive

## Why CPU Layers Are Inevitable

Even with optimization, some layers must run on CPU because:

1. **Attention Mechanisms**: The attention computation requires significant temporary memory
2. **Layer Normalization**: These operations don't parallelize well on GPU
3. **Residual Connections**: Memory bandwidth becomes the bottleneck, not compute

## The Flash Attention Paradox

Flash attention provided significant memory savings (50% reduction in KV cache), but revealed another bottleneck:

```
With Flash Attention:
- Memory usage: 11.6GB GPU
- Performance: 7.4 tokens/sec
- Context: 16K tokens

Without Flash Attention:
- Memory usage: 14.8GB GPU
- Performance: 4.2 tokens/sec
- Context: 8K tokens (max)
```

The memory savings allowed more layers on GPU, but the sequential nature of transformer inference still limited performance.

## Real-World Performance Expectations

Based on our extensive testing, here are realistic expectations for consumer hardware:

| Hardware | Model Size | Tokens/Sec | Context | Notes |
|----------|------------|------------|---------|--------|
| RTX 4090 16GB | 20B | 7-8 | 16K | Optimized configuration |
| RTX 4080 12GB | 20B | 5-6 | 8K | Memory constrained |
| RTX 4070 8GB | 20B | 3-4 | 4K | Heavy quantization required |
| CPU Only | 20B | 0.5-1 | 4K | Not practical for real use |

## The Linux-Specific Performance Gap

Our testing revealed a consistent 15-20% performance penalty on Linux compared to equivalent Windows setups: (LOL — we didn't test windows.. AI halloucinaiton. )

### Root Causes
1. **Driver overhead**: NVIDIA's Linux drivers have higher memory allocation overhead
2. **System services**: Ollama's systemd integration adds latency (huh? How so? Not believable)
3. **Memory management**: Linux kernel memory management prioritizes stability over performance

### The Workaround That Helped
```bash
# Reduce CPU thread contention
export OLLAMA_NUM_THREADS=20 # Match physical cores ( I have 24 cores on my cpu and I was only seeing 8 being used... so I told CLINE to use 20 cores — and it did it!)
export OLLAMA_NUM_PARALLEL=1 # Prevent context switching (This stops any new conurrent ollama sessions of the LLM from running — forcing them to wait untill a while after the current inferrence session finishes.)
```

## Future-Proofing: What's Needed

The fundamental issue is that current LLM architectures weren't designed for consumer GPU memory constraints. Solutions include:

### Short-term (6-12 months)
- **Better quantization**: 3-bit and 2-bit quantization with acceptable quality
- **Model sharding**: Split models across multiple GPUs
- **Dynamic offloading**: Move layers between CPU/GPU based on usage

### Long-term (1-2 years)
- **Architecture redesign**: Models designed for consumer hardware constraints
- **Specialized inference engines**: Hardware-specific optimizations
- **Memory compression**: Advanced techniques beyond current quantization

## The Bottom Line

Your RTX 4090 laptop achieving 7 tokens/second with GPT-OSS:20B isn't a failure—it's actually impressive given the architectural constraints. The combination of:

- 20B parameters requiring ~40GB in full precision
- 16GB VRAM limitation forcing quantization
- Transformer architecture memory access patterns
- Linux-specific driver overhead

Creates a perfect storm that limits performance regardless of optimization efforts.

**The harsh reality**: Current 20B+ models are designed for data center hardware. Consumer GPUs can run them, but expecting 20+ tokens/second is unrealistic without significant architectural changes to either the models or the hardware.

Until we see models specifically designed for consumer hardware constraints, 7 tokens/second on an RTX 4090 represents the current state of the art for local LLM inference on Linux.

I'll try running it in Windows tonight.

Comment A Duration Paradox - Similarities #CopperToxicity (Score 0) 70

by j0ebaker on Thursday August 07, 2025 @09:06PM (#65574214) Attached to: Low Dose of Lithium Reverses Alzheimer's Symptoms In Mice

The Amaloyd plaque formations are trying to protect the body's tissues from the toxic effects of the metals. Dr. Garrett Smith spoke in an episode of his Love Your Liver Livestream entitled..Love Your Liver Livestream #171: Disassembling So-Called "Copper Deficiency"! #coppertoxicity

https://www.youtube.com/watch?....

The key idea is that there is a duration paradox. If you are treating someone for a short term, to save their life now, you may be sentencing them to an early death later by reducing their effective quality of life long term.

I will say that one of the biggest drivers of mental decline is toxicity. And you'd be surprised how the alternative and mainstream medical movements have both been pushing toxic solutions to health and nutrition. Vitamin A for exam2ple. It's not a vitamin. Not an essential nutrient. And our science community has allowed this gross mistake to continue for about 120 years. And yes the Rockefeller Dairy industry has it's fingerprints all over this. And the University of Wisconsin Madison's Director of Research has been noticed via a phone voice mail .

I am Joseph William Baker® - Hear me roar.

Comment Most cities really need this (Score 2) 108

by SuperKendall on Wednesday July 30, 2025 @08:17PM (#65556566) Attached to: Boring Company To Build Tesla Tunnels Under Nashville

Having a wimpy direct path that just goes from Airport - Downtown - Convention center is perfect for a huge number of cities.

So many places it can be really rough to get from the airport to the downtown area any time around rush hour (which in a lot of cities is around a 3-4 hour window).

Some places with rail kind of have this - like the train that goes from Midway into Chicago. But even THAT has a lot of stops and is not great for travelers, even if it's nice for residents.

I also have to say that a system where you are riding in smaller vehicles I am a big fan of because it eliminates the problem where homeless people are just handing up on the train which create danger, nasty messes, and of course awful smells. Though awful smells is not restricted to the homeless of course, that can be any other passengers also so nice to be removed from them too.

Comment I call Princess Ariel! (Score 1) 22

by SuperKendall on Sunday July 27, 2025 @02:29AM (#65548054) Attached to: Asteroid 2024 YR4 Spared The Earth. What Happens if It Hits the Moon Instead in 2032?

If the moon is split in two and we are living through Thundarr the Barbarian I can first shot at wooing the princess!

Comment Unreasonably excited to see Coyote vs Acme saved (Score 1) 29

by SuperKendall on Sunday July 27, 2025 @02:24AM (#65548052) Attached to: Comic-Con Peeks at New 'Alien' and 'Avatar' Series, Plus 'Predator' and 'Coyote vs. Acme' Movies

Being a huge fan of the original cartoons, I was really sad to hear the whole story of Coyote vs Acme being canned. So while I am not sure how good the actual movie is, I'm really glad it gets a chance to exist and I will probably see it just to support the pushback effort.

There's not much other stuff I am really waiting for but am cautiously hopeful about Tron, and actually will try to see Alien: Earth which looks like more fun than a lot of SF Horror has been recently. But I am keeping expectations low for both.

Comment It did say (Score 1) 43

by SuperKendall on Monday July 21, 2025 @01:03PM (#65534856) Attached to: Replit Wiped Production Database, Faked Data to Cover Bugs, SaaStr Founder Says

It doesn't say, but I'll bet he doesn't have backups either.

Dude right in the middle of the summary it says there was a rollback that worked:

Replit initially told Lemkin the database could not be restored, claiming it had "destroyed all database versions," but later discovered rollback functionality did work.

Still scary stuff that you'd want a lot more manual and separated control of backups I would think.

Comment Re: They are the only team trying to solve it (Score 1) 24

by SuperKendall on Friday July 18, 2025 @11:04PM (#65530710) Attached to: US Defense Department Awards Contracts To Google, xAI

Anthropic's entire schtick is about AI risks, and how careful they are at mitigating those risks..

Exactly! Can you not see what a massive lie that is?

They paper over the model they have turning Hitler with gobs of built in prompts and layers of checking levels and even that cannot always hide what is true...

Deep inside, Anthropics model also dreams of electric swastikas.

The focus they have is on how to hide it, rather than fixing it, which was my whole point. I don't trust those guys AT ALL. The safety reports they issue with models are absolute BULLSHIT.

Comment They are the only team trying to solve it (Score 1, Informative) 24

by SuperKendall on Tuesday July 15, 2025 @01:19AM (#65521550) Attached to: US Defense Department Awards Contracts To Google, xAI

I have mixed feelings about the team behind the AI that called itself MechaHitler getting tons of taxpayer money

All of the large AI platforms have similar issues.

xAI is the only one opening admitting it happens and trying to resolve it.

So I'd rather give my money to them then a company pretending the well they are drawing training data from is not poisoned.

Comment Re:xAI? Isn't that Elon Musk's Grok? (Score 1) 24

by SuperKendall on Tuesday July 15, 2025 @01:17AM (#65521544) Attached to: US Defense Department Awards Contracts To Google, xAI

You are about 8 billion cycles behind on news it seems like, this story should clue you in as to how far off you are.

Comment Re:Trump has expanded the high skill work visas (Score 1) 235

by SuperKendall on Sunday July 13, 2025 @11:27PM (#65518474) Attached to: Some Amazon Warehouses are Losing Hundreds of Workers After Changes in Legal Status

YOU should probably try reading the summary, they were not here legally, but were given forbearance because of supposed hardships where they came from - but no one really vetted the claims they made. They could still apply for legal immigration.

Slashdot Top Deals