Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×

Comment Re:Forget it (Score 1) 440

It's very true that there are often a lot of files with near duplicate content. Detecting near duplicates is much, much harder and will be probably orders of magnitudes slower to do, even if you can figure out how to do it in the first place.

However, there are also often a lot of files with exactly duplicate content. A government agency I worked with figured out they had over 30% identical duplication of files across their file stores. This was a signficant cost for them.

So, while your initial observation has some truth, your conclusion to "forget it" is false. I'm reminded of my old boss, who always used to say "Don't let the perfect be the enemy of the good".

Comment I have to disagree... (Score 1) 440

There is no reason to use a crypto-strength hash. This will simply be slower. MD5 should be perfectly fine - it outputs a 128 bit hash, which is more than enough to avoid accidental collisions, and it's fast. You could match on the size as well as the hash, if you really really think you might have a hash match on different content, but it's probably not necessary.

It is true that if you're trying to avoid *intentionally malicious* collisions, you should never use MD5 as it's badly broken for that use - but not for detecting duplicate content. You're correct to avoid using CRC - but that's not a hash algorithm, it's a checksum algorithm. Accidental collisions with that algorithm will be very frequent.

The names of files should never be used to distinguish them. Files are often renamed by applications or during normal work by users. In any case, if you already have a hash match, then why do you care if the names are different? The content is already overwhelmingly likely to be identical. If you're really paranoid, then do a byte comparision of those files.

Comment Use DROID 6 (Score 4, Informative) 440

There is a digital preservation tool called DROID (Digital Record Object Identification) which scans all the files you ask it to, identifying their file type. It can also optionally generate an MD5 hash of each file it scans. It's available for download from sourceforge (BSD license, requires Java 6, update 10 or higher).

http://sourceforge.net/projects/droid/

It has a fairly nice GUI (for Java, anyway!), and a command line if you prefer scripting your scan. Once you have scanned all your files (with MD5 hash), export the results into a CSV file. If you like, you can first also define filters to exclude files you're not interested in (e.g. small files could be filtered out). Then import the CSV file into your data anlaysis app or database of your choice, and look for duplicate MD5 hashes. Alternetively, DROID actually stores its results in an Apache Derby database, so you could just connect directly to that rather than export to CSV, if you have a tool that an work with Derby.

One of the nice things about DROID when working over large datasets is you can save the progress at any time, and resume scanning later on. It was built to scan very large government datastores (multiple Tb). It has been tested over several million files (this can take a week or two to process, but as I say, you can pause at any time, save or restore, although only from the GUI, not the command line).

Disclaimer: I was responsible for the DROID 4, 5 and 6 projects while working at the UK National Archives. They are about to release an update to it (6.1 I think), but it's not available just yet.

Comment Re:give it a try first (Score 2) 235

I have to echo your experience here. I really disliked Unity in the earlier incarnations, and kept my main machine on 10.10 until support ran out. Eventually I needed to do a full system re-install due to replacing a hard drive, and decided to give 12.04 a go. Despite all the Unity hate, Ubuntu has been good to me for many years, so I gave it a determined go.

Long story short - I like it. It gets out of my way. It avoids unnecessary chrome. It works.

It took me about 2 weeks of using it to realise I really quite liked it, contrary to my expectations. Again echoing the parent post, it was often the things people were complaining about the most that I ended up appreciating the most.

I am humbled to realise that my prior bitching about Unity was mostly unfounded (at least as it is incarnated in 12.04). And that I am far more change-resistant than I previously believed.

Comment blindsight (Score 1) 1365

By peter watts is up there. Dysfunctional crew and highly claustrophobic atmosphere in a first contact story that's original and terrifying. Especially the ending, which is a profoundly depressing view of life in the universe.

And the gap series by Stephen donaldson is hard to beat for relentless abuse of and by just about every character on every page of the series....

Comment Re:Pilot/copilot anyone? (Score 1) 491

Agile doesn't pamper developers, and it doesn't codify chaos. It's about getting a good quality product built in the face of uncertain requirements and change. In that sort of environment, treating change as an enemy to be resisted is a sure way to build something that no-one really wants or likes, but fulfils the letter of some requirements document agreed way back when. It's not a panacea, but it can work very well. You still need good people.

It makes partners of the stakeholders. It's very lightweight and done well helps to create a cooperative and respectful environment to work in. It's not appropriate for everything. It doesn't solve all development problems. It has very little to say about actual project management, or how to interface itself with project management. It doesn't really say much about how you get your requirements in the first place.

Comment Re:Flamed (Score 2) 491

It's a huge and arrogant mistake to think that stakeholders aren't talented people (well, they may or may not be, like anyone else). It is true that they almost always aren't talented at software development, and often not great abstract thinkers. From their perspective, you are not talented at retail, or industrial processes, or finance, or... well, whatever they actually know about and you don't.

I'm actually quite appalled at your derogatory attitude to your customers and development partners. Were you just trying to make a point, or do you really believe what you said?

Comment Re:Flamed (Score 1) 491

Well, just because your development process is working fine doesn't mean Agile is a waste of time.

I've been responsible for scrum / scrumban implemented in two separate and very different organisations over the last 5 years. In each case it's worked very well, but in different ways. In both cases, we adapted it to fit in and provide what each organisation needs. This requires intelligence and an understanding of what the organisation needs and what Agile can (and cannot) provide. It also requires buy-in to experiment and find what works and what doesn't.

It's also not micro-management; quite the opposite. It's about empowering teams to own their work, and to develop a new product in the face of changing requirements. I've seen shy developers suddenly start having great ideas about how to improve the development process, and gaining confidence and stature as they see their ideas implemented.

It also involves the stakeholders more directly in the development process. I've seen stakeholders move from cynical and disengaged to becoming a real development partner (the short iterations and product reviews are wonderful for that). Many people simply can't visualise what they really want from a product until they see something. They are not abstract thinkers. When they can see something, it's amazing what fantastic feedback and ideas you get from the people you're building the product for. It's even better when the same people see the changes they requested appearing a short time later in the product. This empowers the client as much as the development team. Done right, it's amazingly helpful.

I did have a fight with some of the more traditional project managers, who saw getting ongoing good feedback from the customer as a failure of our requirements elicitation stage. But for them, ongoing change is a sign of failure in an earlier stage - very much waterfall thinking.

Anyway, that only scratches the surface. I think Agile has a lot to offer, as long as you understand it's not a magic bullet. You still need good people (and it may be true that good people could use any methodology and produce good results). It does encourage a team spirit and a partnership attitude done right, which I haven't got from any other development or project management methodology.

Genuine question: how do you do development - any good ideas you could share?

Comment Re:Explain the mind of a genius? (Score 1) 414

I think that you're absolutely correct in that mathematics requires some kind of actual grasp of the relationships being expressed by the maths.

For you it had to be something you could relate to a real world problem. I'm almost the opposite - I have to get something distilled down to pure abstract relationships before I feel I really get it. But either way, you have to do some work and understand what is being expressed by the maths.

I studied cryptography a while ago, and I had some difficulty working through the maths. I got really stuck at one point in the course material and had to ask for help. The professor responded that they'd fudged that part of the maths a bit to avoid confusing people, and they didn't think anyone would get that far into the maths. He also gave a perfectly clear explanation of the source of my confusion - providing the missing bit of information about the relationships involved in the math.

I was left amazed at the realisation that most people pass this stuff without actually understanding it in any real way.

Comment Use JHOVE (Score 2) 247

The JSTOR/Harvard Object Validation Environment:

http://hul.harvard.edu/jhove/

It's specifically designed to first probabilistically identify files, then attempt to verify their format.

Disclaimer: I haven't worked on it directly, but I did spend a number in the digital preservation space, so I probably know some of the people who have contributed to it.

Comment Short term vs long term freedoms (Score 1) 369

I've wrestled with whether to release under a GPL or BSD style license for two projects. I can see the pros and cons of both, but in the end chose BSD.

The code is of niche appeal, and I'd be happy for anyone to get some use out of it, whether in a proprietary product or anywhere else it's useful. It probably won't have a vast lifetime or attract a large community around it. So the need to preserve the openness of the code over long periods of time just isn't there.

In the shorter term, I'd rather have as many people as free as possible to do anything they like with it. If the code was more generally appealing, I might choose a different license.

Comment Another Anecdotal Datapoint (Score 1) 188

I'm currently writing up API documentation for a large code branch which was never properly documented (and wasn't written by me), but now needs merging into trunk. I've found several serious bugs in the code as a result, all from trying to explain to the client how to use the API. These bugs were actually blindingly obvious when the behaviour of the code had to be explained.

I've also found some horrible design issues, where various settings the code allowed were contradictory or meaningless, or one setting overrode the behaviour of another, unless something else had been configured, in which case... you get the idea. As soon as you try to explain it, the awkwardness makes the design problems incredibly obvious - and in fact a much better way of doing it is also obvious. I can almost picture the evolutionary process by which this code was written, with each developer adding a new feature without regard to how the whole thing hung together.

So documenting after the fact can definitely detect bugs and design weaknesses - although I don't think this is an ideal way of doing so. Having said that, I'm not sure forcing the developers to document their design beforehand would have helped either, as a lot of developers simply regard documentation as a necessary evil (assuming it is enforced), and will simply write down whatever it is they intend to code, no matter how awkward the result.

I guess you have to have a mentality which loves elegance at all levels, not just in the specific lines of code that are written but how the system as a whole functions. Unfortunately (and surprisingly to me), many developers don't seem to care about the bigger picture, even when they have a deep appreciation of clever or elegant code.

Slashdot Top Deals

For God's sake, stop researching for a while and begin to think!

Working...