linderdm - Slashdot User

Comment Re:free markets (Score 1) 586

by maraist on Saturday January 01, 2011 @06:17PM (#34732962) Attached to: Once-Darling Ethanol Losing Friends In High Places

How exactly does this relate in any way to free markets? Are you suggesting that anarchy is the logical extension of a free market? The lack of a central representative government is called the dark ages - where you have local mafia style rules by local thugs. Well, Afghanistan if you must. Free market in the sense of a barter based society, I'll give you. Is this the direction you were trying to take the conversation?

Next, the target of military endeavors is hardly different than tribal disputes (with no central government) - organized defense against foreign aggressors is hardly morally abject. The occasional desperate local citizenry deciding that it's in their best interest to agress their neighbors. I'm not personally convinced that this is a failure of government. Though I would count when the majority condones/supports a racist government slaughtering segments of it's citizens:
Hitler, Stalin, Sadam, Khmer Rouge, etc.

Comment Re:Who is this guy? (Score 1) 182

by maraist on Saturday January 01, 2011 @01:03PM (#34730660) Attached to: Joel Test Updated

"You'll have to manage it at the "ego" level, not the tool."
You're only addressing a justification for the actual problem which is experimental code paths.

"You give them access to a ACLed branch or you give then a different repo and you use the "vendor branch" pattern to merge it back to your trunk."
Are you serious? You want to grant firewall, VPN, internal login-names to an outside firm? You want to maintain customer support of their development environment?

In this case, the point is that they go and do their own thing - you require that they use SOME DVCS. Then they zip up a snapshot and send it to you. Deviations from the shipped product can then be co-maintained if need-be. This isn't possible with a CVCS, since you'd have irreconcileable forks.

"Read the user manual for, say, Subversion. You probably will be surprised."
Almost the entire DVCS world is a reaction to subversion/CVS deficiencies.

svn branch merging can be made to work, but leaves a tremendous ammount of meta-data (every branch range-delta). It can be a nightmare to keep track of if you do any degree of branching beyond feature-branches. Then your branch tree gets enormous over time, so you'd like to blow away the namespaces. But then you have this bolted-on feel to the history of the now phantom branches. And it doesn't take much to bork your data (and not notice it for several versions). Add in a handful of code-reformatting and it's a nearly impossible mess to recover from.

Note this isn't a CVCS v.s. DVCS issue, it's a subversion issue. But that was the line-item at hand.

"And you do it just the same with a central than with a distributed source code management system."

Well, not really. The use case of linux is that there are several co-maintained forks. Is this just an open-source problem? Hardly. Try forking your own company's product. Why would you do such a horrible thing, you might ask? Well, if your 'core' is leveraged in say a custom cookie-cutter shop, then the core can get overly bloated and require slimming or specialization. Or maybe you just want to start from scratch, but fully intend to maintain the original code base for years to come (possibly with new major features that you'd like to share between the old and new fork). At a minimum, to maintain reconciliation, the CVCS server needs to be common between the two forks - not plausible in open-source forks. Many VCS plugins do support the subversion sub-path concept. Some, by default do not (trac commit plugins, etc). And just in general, it's nice to have a dedicated VCS per project - thus forking a project gives it an unfortunate status. And once you decide the fork gets it's own CVCS, you're stuck with diff -C5 / patch. Welcome back to RCS.

Comment Re:Who is this guy? (Score 1) 182

by maraist on Saturday January 01, 2011 @12:43PM (#34730518) Attached to: Joel Test Updated

If the point is to outsource a subsection of the code, then the implication is that they 'own' a sub-tree of your code. The point is merging separately mirrored CVCS's is hard, merging decoupled VCS's is possible, and DVCS's make the latter trivial.

But that even remotely owned code-sections can be tweaked and via email/phone collaboration, merged on an ongoing basis.

And no, giving them restricted access is not always an option - VPN, firewall rules, login polluting namespaces, central LDAP/kerberos or what-have-you security risks, etc.

Comment Re:Easy (Score 1) 586

by maraist on Sunday December 26, 2010 @08:21PM (#34672780) Attached to: Once-Darling Ethanol Losing Friends In High Places

Nice echo chamber commentary; but do you have any evidence of free markets 'saving' or 'naturally-selecting' lives? My intuition, which I think might be easier to support argumentatively, is that life/death is not allowed to be at the whims of the market - that government and thereby social aggregate decision making has produced almost all of the worlds life/death decisions and will continue to do so well into the future as 'greed-based' (aka self-interest-promoting) invisible-hand nonsense continues to demonstrate itself as anything but life-saving.

Free markets didn't decide to clean local drinking waters of factory toxins. Free markets didn't decide that California air (and now Hong Kong air) was toxic and thus needed production-killing regulation. Free markets didn't decide that copper-smelter and sulfer-emitting smoke-stacks shouldn't be up wind of any living thing. Free markets didn't decide that seat belts, working car-lights, air-bags, and rigid-frames should be installed - yes some people desire them, but the market clearly pointed in the opposite direction (people argued for years against air-bags, or that they'd prefer to be thrown from a car-wreck - at 80mph even).

There's nothing wrong with the Adam Smith theory of free markets.. But that theory is so far divorced from reality as to be counter-productive to society. Especially that little tid-bit about all participants being well informed and having equal access to the information which provides them the ability to make efficient decisions. Even in the absence of a complete lazy-assed-couch-potato middle-class. Even in the absence of mis-information corporate machinery. You still have a number of human psychological phenomina which prevents a man from choosing his best outcome. And this extends to the corporate world (where presumably financial incentive would supersede all else).

Comment Re:Who is this guy? (Score 2) 182

by maraist on Sunday December 26, 2010 @02:22PM (#34670876) Attached to: Joel Test Updated

There's something to be said for 'bad' use of DVCS in a private company. But here are the good usage patterns IHMO
1) checkin after every logically complete operation (for What The Fun Just Happened moments)
2) checkin every night (so if I'm sick tomorrow people can get to my work)
3) my-code-doesn't work, collaborate with someone down the hall or geographically remotely
4) I want to experiment with an alternate code path (but don't want to deal with the politics - remeber coders have egos)
4a) I want to experiment with an alternate code path, but don't want any risk to the trunk
4b) This code is too specialized, we need a much simplified version for this use-case (but need to maintain the original code path)
5) Let's say we suck at graphics, so we outsource to a 3rd party company. How the hell does this happen with central version control. F-no do we give them direct access. And if they email us a zip file of the final product, how do we keep in sync from there-on-out? Their or our changes will get over-written or go into non-versioned-hell. With DVCS, we can provide read-only access (possibly via emailed repository-clones). DVCS allows trivial re-integration. Security is maintained, reconciliation is trivial. History and auditing is somewhat maintained (you can obviously fake it). And most importantly we can switch a new NEW 3rd party contractor at any time, possibly even AT THE SAME TIME.
6) rebase (not DVCS specific). If I've got 10 branches (in svn or any place else), do I know for sure that after a while, the history has gotten too complex? In git at least, we can say, ok, these feature-branches should all be thrown away - lets 'rebase' to produce a pristine trunk and quite literally throw away all the branches by flattening them.
7) central 'owner'/'maintainer' of a given project. Make sure someone knows everything that's going on with the project by having them integrate or 'bless' an integration. With central repositories, this requires they do the merging. with DVCS, you do the merging as a 'candidate' and they either accept it as their own or not (e.g. fast-forward merging of your repository with theirs).
7a) as with linux, for larger projects development-teams, you can have lieutenants that perform step 7 for sub-sections of the larger project. For which each lieutenant will 'trust' each other's official repository and auto-fast-forward-merge. The singular project-manager can then choose for political reasons (because we are political in nature) to disagree with lieutenants decisions - as they are they primary responsible party (at least in closed-sourced commercial solutions). This works because lieutenants can continue with their private fork until they can form a mutiny - so ego's are maintained.

Comment Re:No More Deregulation (Score 2) 551

by maraist on Saturday December 25, 2010 @10:31AM (#34665290) Attached to: How the Free Market Rocked the Grid

Well, what I haven't heard in this thread yet is that public utilities fall into a special category called natural monopoly. Phone-[land]lines, bridges, power, train-ways, and several other things deal with natural scarce resources or access points / paths, and thus it doesn't make sense to have 1,000 companies competing to give you the next MP3 player - may the best brand win, or rather the low-production-cost supplier win. Natural monopolies do not have 'free market solutions'. They REQUIRE social intervention (who gets to own the land for the train-track?). The generally idealized natural monopoly is a heavily regulated system (they can only charge cost + 10%) and are subsidized to overcome their over-heads or fixed capital investments. Meaning Everybody pays taxes to pay for the infrustructure, then they pay cost + 10% for the marginal cost of production (of water, electricity, road maintenance, etc). Sometimes you can get away with government backed bonds that handle the capital expenditures. It's extremely anti-market. You have the producer now WANTING to be wasteful, because their '10%' gives their better total profits. Also they have little or not value in capital investments - that would generally only be useful to increase efficiency - which again, is opposing their bottom-line. The only time they might want to perform capital investment is to increase capacity.

Thus in ideal situations, you the people subsidize redundant providers, for at least they can compete for a larger share of the pie, but even then you have trivial mafia style oligopolies.

Comment Re:Call me skeptical (Score 1) 222

by maraist on Saturday November 27, 2010 @10:32AM (#34357672) Attached to: Horizontal Scaling of SQL Databases?

Not sure what you're saying. Why do you suggest relational models support more situations. You can not model recursive situations effectively. You can not model hierarchical data-structures - at least not ones with cycles. The join syntax itself is very verbose and when there are significant numbers of indexes, the number of permutations of possible join strategies grows exponentially (if you had 200 tables joined in a single query, with each table utilizing 4 indexes, you'd have a nearly impossible to optimize query). Yes this is an odd query, but only because RDBMS does not support this style of data-traversal - many systems would crap out at 1 to 4KB of SQL syntax. Not to mention the locking structure overhead would practically serialize access to the DB (yes, I know, why the hell would a non-transactional read cause locks.. because joins just simply suck in most RDBMS implementations).

Compare that to an OODBM like Objectivity, where joins are replaced with 64bit foreign pointers to virtual addresses in possibly alternate storage spaces. And more importantly, the SQL schema replaces a join with a single dot, which is very familiary to Object Oriented systems, including ORM layers.. So "SELECT person.father.mother.daughters[0].siblings[0].employer.company.website.contact.phone FROM person WHERE id=?" is a legitimate query. It is FAR more expressive than if each was a separate table join. And each traversal requires a potentially uncached pair of page-lookups - one for the virtual mapping table, and one for the actual disk block. Compare that to traditional indexed based foreign keys which require log_base256(n) cold disk hits per join. Both OODBMS and RDBMS support B-Tree and Hash-Map symbolic indexing (e.g. login-name, range searches, etc). But for the simple graph traversal, OODBMS is just hard to beat.

And no-sql solutions are really all about documents in general. complex data-structures which may or may not be hierarchical, yet have schema validation support (see voldemort JSON data-store, or couchDB). Depending on the schema, new inserts can extend the schema on the fly, or via DML statements, you can enforce that all NEW requests have a new schema, while leaving old records with a previously well defined schema. The document would have to retain it's schema id. Certainly an RDBMS could do this as well, though most are optimized to support highly structured rectangular data, with only the use of nullity supporting 'optional' schema additions.

I do, very much like the set-nature of RDBMS, and for large complex cross-table index based queries (e.g. I need an index from tables A, B, C that are not their primary/foreign keys), RDMS supports some pretty damn complex capabilities - I'm specifically thinking of postgres, where you can do hash-joins of 5 separate queries, each with their own index, or covering indexes where you don't even need to access the main row to get the result (which is essentially what most nosql solutions do), or synthetic-function-output indexes (indexes on the output of functions instead of the data itself) or conditional indexes, where you fine-tune which you know you'll search on (create index job_state on jobs(state) where state not in ('COMPLETED','ARCHIVED')). Most nosql solutions are no where ready to support these complex search optimizations - though things like couchDB do allow you to have lazy indexes with user-defined functions - but I think they required indexable data on every row.

But I don't see most of these advantages as being specific to RDBMS - just in their maturity. HBase, cassandra are still in their infancy - not even official releases yet from what I remember.

Comment Re:Go Java Go (Score 1) 204

by maraist on Friday November 26, 2010 @07:52AM (#34348998) Attached to: The Details of Oracle's JDK 7 and 8 'Plan B'

Hense the introduction of convention-based-configuration.. e.g. zero configuration except where it deviates from convention... In other words, field A matches a class also named A. Controller B matches view named B. HTTP form-field C matches input field C. In one sense it's magic/side-effect based coding. In another, it's intuitive programming.

I've personally taken thousand-file projects (quarter million lines of code) and only required about 1,000 lines of XML (almost all dealing with database, bootstrapping, and environment settings - for which you'd have needed more java code to have done the same thing).. And 5x as much C++ code. There are definitely polymorphic cases where this breaks down, but I've found that OO is highly over-rated - especially when dealing with database persistable objects - I've gone back and replaced OO-DB styles with switch-statement dispatches to OO data-structures.

Comment Re:Really? (Score 1) 897

by maraist on Sunday November 14, 2010 @11:10AM (#34222230) Attached to: Which Language To Learn?

Huh? digging ditches means you work in ... a ditch... Many of us consider dealing with the arbitrary, constantly changing short-comings / inconsistencies / poorly-thought-out OS decisions of the primary .NET platform to be like working in a ditch. And no, you can't do much useful with .mono - all the power of .NET is in it's libraries which are tightly tied to the OS of dis-choice (for that few of us).

Comment Re:Wrong layer (Score 1) 195

by maraist on Friday September 17, 2010 @08:31AM (#33609648) Attached to: Data Deduplication Comparative Review

Sorry, but we're either in disagreement, or you're not understanding.

When a Unix partition gets to the last xth percent remaining space it locks the partition down to all but root.

When a partition gets to 15% free, all sorts of monitor / alarm bells SHOULD be ringing (if you have a properly configured system).

If you get to the 50% mark, then you need to start planning ahead for an upgrade.

By over-allocating, you can do this at a group level instead of on a per-partition level.

Thus keep all partitions 15% full (without wasting 85% of disk-space - due to over-allocation).

Running out of disk space is running out of disk space - whether it's at the ext layer or LVM layer or NAS layer.. You should be monitoring and planning ahead no matter what layer it's in.

The fact that editing a pre-existing block CAN cause a failure (because of sys-admin-neglegance) is NOT a fault of the application or technology. Especially since there is no difference between it failing due to pre-existing disk-allocation v.s. appending to a file (/var/log/messages). It's the same as saying 'well, my application polled the disk-free in second one, then assumed it was safe to allocate an extra 4Meg.. But then when it came time to do so, there was no free space. waaah).

Comment Re:Don't forget to weigh in the cost (Score 1) 195

by maraist on Friday September 17, 2010 @01:48AM (#33607852) Attached to: Data Deduplication Comparative Review

HDFS disk size is meaningless herein.

Likewise, with mysql-INNODB, I can utilize 25 4TB eSATA externally managed machines (assuming 4 2TB disks RAID-1'd together), each mapped to a 4TB block-device, which INNODB treats efficiently.. (Or if using ext4, I could use LVM to map those eSATA devices together for a general purpose disk).

I could even have LVM stripe those remote volumes to get better IOPs.

At $700 a base machine (gray boxes) (including disks), that's $17,500.

If I didn't care about random-write-speed, then I could go with RAID5. Put 3 disks in each machine and reduce costs to $15,000.

Or I could go with RAID5 on a hot-swappable 16-disk $1,000 RAID controller and reduce it down to 4 machines. Bringing the price down to $11,600.

We're assuming either with HDFS or mysql or any other app, that you build redundancy on TOP of the applications. Which is the ONLY smart thing to do with enterprise grade applications.

Failure of a disk is ASSUMED.. Meaning your $100,000 netapp WILL fail one day.. You are an IDIOT if you don't believe this.. Sure it may take 10 years. But what happens then? Replace it every 5 years? How about every 3? That's not a capital cost, that's a variable cost. Sure your data may be worth it. But as a business, can you attain the same degree of reliability cheaper? HELL YEAH. And the 3 year replacement cycle doesn't handle a power surge which blows the hardware (say a rogue UPS). Sure put redundant power supplies on isolated UPSes - ok, I'm unplugging an ethernet cable and accidentally cause an electrical surge which blows the network controller.. OPS, data is safe, no access! 2 days down-time!!

The point is RAID was invented to solve a class of problems in a cost effective way. It doesn't solve every problem, and I completely agree that solutions LIKE netapp are GREAT when you want to start medium and leave room to grow large. When you want to consolidate and dynamically repartition disk space AND spindles (e.g. vmware solutions). When you want lower maintenance costs (avoiding having to rebuild lots of regularly failing gray-boxes, constantly swapping out one of hundreds of $100 2TB disks).. BUT I submit to you that on scale, this is cheap / slave-labor. You hire high-school students to replicate base OS hard drives (with a replicating station), you buy 25% overstock of base hardware so you have fast (30 minute) build new machine / deploy into cluster, environments. You only need accuracy, you don't need intelligence. Yes, a $30k / year salary is more than the extra $30k you spend on the netapp.. But you'd have to buy dozens of netapps to really scale an enterprise solution.. And that same $30k can usually handle it.

So there is absolutely a scale region where a netapp makes sense.. But it is NOT the high end - which is what they'd like you to believe. And I submit that there are application-level redundancy solutions which are more reliable (though at the cost of semi-static configurations - which a netapp type system does provide value-add).

Comment Re:Wrong layer (Score 1) 195

by maraist on Friday September 17, 2010 @01:19AM (#33607736) Attached to: Data Deduplication Comparative Review

Uhh, what does LVM do then? Oh yeah, you OVER-ALLOCATE.. My bad. And yes, with LVM-snapshots, you very well can crash the system if free space is maxed out. I don't recall, but I believe it deletes the snapshot, but since that's a mounted file-system, it's just as bad.

There's also commercial NAS hardware which works like this. They have little green, yellow and red lights next to each physical disk.. Supposedly you should swap out a yellow or red disk with a larger one to avoid either automatically reducing RAID redundency (e.g. 2 disk redundancy reduces down to 1 disk redundancy), and then ultimately producing seek errors when no remaining physical blocks can map to a requested virtual block. I forget the name of the vendor in question, but it was far cheaper than a netapp - but really meant to sit next to your workstation (obviously).

It's not a new concept at all.

Comment Re:Wrong layer (Score 1) 195

by maraist on Friday September 17, 2010 @01:15AM (#33607702) Attached to: Data Deduplication Comparative Review

No, NFS should be doing this, that way you aren't tied to specific file-system or disk systems limitations. ;)

Comment Re:Wrong layer (Score 1) 195

by maraist on Friday September 17, 2010 @01:11AM (#33607686) Attached to: Data Deduplication Comparative Review

Haha. I call small-minded skizzies on your sir!

Imagine a specialized net-appliance (screw netapp). It has 32 Gig of RAM and a 512Gig high-speed random-access SSD (where read speed is more important).

Split the 512Gig into two 256Gig portions.

The first portion contains 4 bytes of the MD5 sum of each 512B block (represents up to 32TB of block storage).

Every 2048B block being back-ground scanned for deduping does an SSD lookup against the 256G SSD hash-map which is open-chained and points back to existing 2048B blocks on disk. This lets us efficiently cross-link (reverse copy-on-write). I'd prefer 512B block boundries, but most file systems use 2048B blocks (or large) and HD's are starting to move to this to increase ECC efficiency. Plus it just reduces the overhead.

So that's just a minor optimization of whatever people have already been doing in software.. bla bla. Boring stuff, right.

For those blocks that DIDN'T match...

We do a modified version of zlib compression (which only stores 32KB worth of back-data). We extend this to store 4 Gig worth of code points (assuming a 4Byte identity prefix match and 4 Byte SSD disk block pointer). Each reference is a 256Byte block which thus supports up an 8 bite length pointer.

So now as you scan through the 2048 byte blocks being stored on disk, you do a hash-lookup of every consecutive 7 bytes. You hash the first 4 bytes and lookup in RAM. If matched, you lookup in SSD the remaining byte-string and see how many bytes of match. If more than 7, you store a disk-pos + length vector. Saving you at least 1 bytes (1 byte magic, 4 byte pointer, 1 byte length), and possibly the entire 256 bytes. If you can compress to one of 50% or 75%, you store at 1024 bytes or 512bytes.. As soon as you reach either of these two boundries, you stop compressing. Though this does assume you're not using 2048B boundry HDs. You then store into one of 2 special areas on disk that are 1/2 and 1/4 block compressed.

So this solves highly compressible but single byte-offset situations.. e.g. I copy sections of source-code (at least 512 bytes) and paste them either into the same file or some other file.

So long as you don't pull out the HD, the ref-map in SSD matches the previous runs on disk, so you don't have to do random disk-seeks to reconstruct the blocks. So now reading highly compressed blocks not only reduces the number of bytes read from physical disk, it increases the ratio of SSD to HD reads. :)

I'm only joking of course. But not really.. You hiring netapp??

Comment Re:Wrong layer (Score 1) 195

by maraist on Friday September 17, 2010 @12:26AM (#33607450) Attached to: Data Deduplication Comparative Review

This thread seems to be getting too defocused from reality.

Here's the rub.

Checksumming == good. All else being equal, we should have more of it.
But checksumming is expensive (adds latency to your write).

So once you have it, might as well use it.

Background thread can compare checksums of blocks as starting points to identifying identical blocks (since checksum collisions are more than possible, they're only a matter of time - I see colliding MD5 sums all the time in BackupPC - you can tell because they append a semi-colon + sequence ID to the file-name to disambiguate).

As some thread posters have listed - file-names prevent entire files from being block-shared.. Rubbish. File-names in Unix file systems have never been coupled with file-metadata. Files are identified by inode numbers, not file-names.. file-names are meta-data stored in directory files (which is why hard links are possible). Now unless you have noatime in your mount options, replicating inode descriptors will be nearly impossible, but that should only be a small fraction of your disk blocks anyway.

Historically, the main way you'd leverage shared blocks is through snapshot images - which all use copy-on-write. LVM and netapp and I'm sure dozens of other vendors supply this because it's trivial to do.

All this is really likely doing is extending the existing SNAPSHOT copy-on-write logic to merge blocks from different file-systems (which snapshots technically are) AND from within the same file-system. And most likely done through block-level checksum comparisons. Though since ext and many other file-systems don't naturally support check-sums at the block level, I doubt this is leveraging file-system level operations.

Slashdot Top Deals