A lot of times people who don't know about joins do the basic join of select x.a y.b from x, y where x.c = y.c Not realizing that Most SQL engines will take all the records of x and cross them with y so you will have x.records*y.records Loaded in your system, the it goes and removes the matches. So O(n^2) in performance, Vs. If you do a Select x.a, y.b from x left join y on x.c
Sorry, but it doesn't work that way. As far as I know, none of the decent SQL engine choke on it, although I'm not sure on Access
Also, a lot depends on the size of the dataset and other parameters in the where clause. Real life example, with len(r) = ~1M and len(g) = ~20k: select * from core_report r, core_guild g where r.guild_id = g.id and g.id = 7. With this query, postgres executes it as: scan core_report_guild_id index, look for id=7. Then, lookup g by primary key and join it in a nested loop with loops=1. Without the g.id = 7, it executes as: table scan g, hash it, table scan r and join the two with a hash join. Note that the query planner switched from fetch by primary key, which is O(log n) * n rows -> O(n log n), to table scan x2, O(n), but with a much lower actual cost because walking a BTree isn't cheap. It also ordered it so that only 20k rows get hashed and copy-pasted into the main dataset, not the other way around. That's the advantage of using a proper DBMS.
You can pry PostgreSQL from my dead, cold hands. It's just so much easier to do meaningful things in a relational database, and until you hit the db-size > largest SSD (used to be RAM) you can buy limit, there is absolutely no reason to limit yourself to glorified tuple stores and hash tables. Okay, sometimes, ORM's can be slightly too eager to join stuff (causing queries like this one), but it's easily fixed by rewriting the line executing the query. Or just ignore it, even that monstrosity (1 index scan, 1 fetch-by-id loop, 3 full table scans) took only 1s max - who cares on a homepage/intranet/most websites.
This is a classic case of bad defaults. Yes, you will always have a trade off between performance and security, but going for either extreme is bad usability!
People expect that, without explicit syncing, the data is safe after a short period of time, measure in seconds. The old defaults were: 5 seconds in ext3, in NTFS metadata is always and data flushed asap with but no guarantees. In practice, people don't lose huge amount of work.
What happened is that the ext4 team thought waiting up to a *minute* to reorder writes is a good idea - choosing for the extreme end of performance.
My question is: WHY? Does it really matters to home users that KDE or Firefox starts 0.005 seconds faster? Apparently, the wait period is long enough to have real life consequences even with limited amount of testers, imaging what happens when it gets rolled out to everyone. On servers, it's redundant. Data is worth much, much more than anything you hope to gain and SSD's, battery backed write cache on controllers and SAN's have taken care of fsync's() already. If you run databases, those sync their disks anyway, so you just traded a huge chunk of reliability for "performance" on stuff like
The "solution" of mounting the volume with the sync everything flag is just stupid. Yay, lets go for the other extreme - sync every bit moving to the disk. Isn't it already obvious that either extreme is silly?
Just set innodb^W ext4_flush_log_at_trx_commit on something less stupid already, flushing once every second shouldn't kill any disk. Copy Microsoft for config options:
* Disable flush metadata on write -> "This setting improves disk performance, but a power outage or equipment failure might result in data loss".
* Enable "advanced performance" disk write cache -> "Recommended only for disks with a battery backup power supply" etc etc.
* Enable cache stuff in RAM for 60s -> "Just don't do it okay, it's stupid."
The tree of research must from time to time be refreshed with the blood of bean counters. -- Alan Kay