Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×

Comment Re:Suspicious (Score 1) 91

"Chinese and back" is not a valid metric. Translation party sites are fun and all, but translation engines are not symmetric because languages are not symmetric. A translation is often going to be imperfect, so using a raw translation as input just amplifies the error level.

The metric you want is: How much effort would a human have to put in to make the translation output idiomatic for the target language? And the answer to that is decreasing rapidly with modern quality rule-based translation engines.

Comment Re:Interesting name. (Score 1) 34

MS marketing cannot come up with unique names to save their life, or they just prefer to take generic terms and slap Microsoft in front. Either case, it's truly getting annoying. MS Surface, MS Edge, MS Stream, Windows Phone, all horrendous names. And when they do come up with original ones, we get Zune.

MS marketing hated the original Xbox name and tried to get it changed, but was shot down by popular vote. They've never been able to figure out what names would resonate with people.

Comment Re:Rule-based still easily best (Score 1) 56

...the problem with rule-based grammars that lack any statistical weights is that they come up with an unbelievably large number of parses for many real-world sentences.

Generative grammars suffer from that problem and scales very poorly, and may indeed be impractical to use for real world text. Our constraint grammars and finite-state analysers do not have that problem. With CG, we inject all the possible ambiguity into the very first analysis phase, then use contextual constraints to whittle them down, where context is the whole sentence or even multiple sentences. This means performance scales linearly with number of rules.

So the 96% accuracy claim is suspect, not to mention that a comparison of the Google system is already difficult because Spanish =/= English. (Spanish has more morphology on verbs, it's pro-drop, it has relatively free word order compared to English,...)

The paper is for Spanish, because that's what I could find. Our other parsers, including English, are also at the 96% or better stage, but because it's mindbogglingly boring to do a formal evaluation, we don't have up-to-date numbers.

So I don't believe you can say that "Google is hopelessly behind the state of the art."

Given that we had 96% in 2006, 10 years ago, and Google only now has reached 94% (90% for other domains), I feel confident in saying Google is very far behind.

Comment Re:Rule-based still easily best (Score 1) 56

Who said they're giving away their best stuff?

The nature of machine learning does. All they're giving away is an algorithm and a system trained using that algorithm. Linguistic machine learning is a field where even a 0.5% improvement takes years to get and is worth a paper. So even if they aren't giving away their top algorithm, their best one can't be much better.

Comment Re:Rule-based still easily best (Score 2) 56

which seams much more expensive than

It'd seem that way, but it's really not if you factor in the whole chain.

Machine learning needs high quality annotated treebanks to train from. Creating those treebanks takes many many years. It is newsworthy when a new treebank of a mere 50k words is published. Add to that the fact that each treebank likely uses different annotations, and you need to adjust your machine learner for that, or add a filter. Plus each treebank is for a specific domain, so your finished parser is domain-specific. If you want to work with other kinds of text, you need to produce a treebank for that domain and then train on it.

Thus, the bulk work is in annotation and mathematical models. Google skipped the step of creating a treebank, and instead use available ones. There aren't any usable treebanks for smaller languages, making the whole machine learning endeavor useless for all but the large languages.

Rule-based parsers are the opposite of that. You can put the same amount of man hours into creating rules as you otherwise would a treebank plus mathematical model, but you can do so on any old laptop with almost zero data to work from. You just need to know the language. A parser produced in this way is not domain specific, but can be easily specialized for a domain if needed. And a rule-based parser can be used as a bootstrap engine for creating high quality treebanks, because the rules are upwards 99% accurate, meaning humans only need to put a fraction of work on top of it.

And as I wrote, rules are debuggable. You can figure out exactly why a word was misanalyzed, and fix it. Machine learning can't do that. The edit-compile-test loop of machine learning is in weeks or hours - with rules it's in minutes or seconds.

Comment Rule-based still easily best (Score 2) 56

94% syntax is definitely good, for a machine learning parser. Now if you were to come to the land of rule-based parsers, 94% is the norm.

Google loves machine learning, and it's easy to see why. That's how they made their whole stack. They have the huge amounts of data to train on, and the hardware to do so. It's so seductive to just throw a mathematical model at huge amounts of data and let it run for a few weeks.

Rule-based systems don't need any data to work with - they just need a computational linguist to spend a year writing down the few thousand rules. But the end result is vastly better, fully debuggable, easily updatable, understandable, and domain independent. That last bit is really important. A system trained for legalese won't work on newspapers, but a rule-based system usually works equally well for all domains.

In 2006, VISL had a rule-based parser doing 96% syntax for Spanish (PDF) - our other parsers are also in that range, and naturally improved since then. Google is hopelessly behind the state of the art.

Comment Works for me (Score 1) 267

I've used iTunes on Windows for many years as a music player only, and while it definitely has some annoyances, nothing else seems to do all the things that I want:
- auto-organize its own folder
- not reorganizing external folders
- volume normalization
- smart playlists

It is oddly lacking support for Ogg Vorbis and FLAC, but you can install 3rd party support for those.

I've tried several other music players, but none seem to do all of the above. The most promising ones unfortunately lack the expressive power of iTunes smart playlists, such as a playlist of "matching album Diablo or Torchlight, rated 3+, limited to the 25 least often played items, auto-update list after each play".

Comment Abuse potential, race to bottom (Score 1) 293

Customers would be able to see a map of 'risk zone' data for places they want to go, such as stores, restaurants and roads. They could then plan the day 'with an eye toward how risky such endeavors may be,' according to the patent application."

Want to drive a competitor out of business? Stage some "risky" things in his area.

And who gets to decide what's risky anyway? This could blow up tiny incidents to something that causes massive droves of people to avoid a store.

And yes, while this is already somewhat possible with today's internet, we don't have a central authority who decides what's risky, and certainly not one with money invested in inventing riskiness.

Comment Re: WTF? (Score 3, Informative) 75

Depends on what you mean by fail, but nothing new ever really beats C++. Sure fancy new languages keep popping up with features, but it either lacks portability or performance or control or higher level constructs or reliability or something else that C++ can provide. And eventually C++ gains the language feature anyway, but without sacrificing efficiency for it. Many languages have tried to take C++'s place, but so far none have gotten close. And even where other languages have a good hold (Java, C#, ObjC, etc), when you want an efficient library shared between those ecosystems, you'll write that in C++.

And no, C doesn't really count in the comparison, because it doesn't grow - while WG14 publishes new standards, it's still mostly C89 used in the wild, with a few extensions.

Comment Re:BTRFS is getting there (Score 1) 279

I'm curious - what sorts of data at home do you store that contain lots of duplication?

I should've qualified that. The home backup system is the part of it that I have here at home, but the data is from several servers around the world, plus my personal files. And of course there's other backup sites so it doesn't 100% rely on my house or connection. And I have since improved my part with a dedicated machine rather than VirtualBox, though still USB attached storage because I had the disks anyway.

Comment Re:BTRFS is getting there (Score 1) 279

Er, snapshots should be immutable. They're used as sources for backups and replication, allowing them to be mutable would defeat the main purpose.

zfs clone if you want a writable copy. What's wrong with that?

The problem with zfs clone is that "clones can only be created from a snapshot" which means that deleting a file from a clone does not delete the file from the underlying snapshot, so the space is never actually freed. So when I accidentally have a very large temporary file in my backup set, it's stuck taking up space until it cycles out of history.

Comment Re:BTRFS is getting there (Score 2) 279

I used to use ZFS on my hacky home backup solution (Linux in VirtualBox with USB storage - yes, I know), but it would corrupt the disks once per month or so. Switched to btrfs, and it just works.

Features that btrfs has over ZFS, and I use:
- Mutable snapshots. It is infuriating that ZFS's snapshots are immutable. Mind you, I very rarely modify snapshots, but I damn well want to be able to without having to dump+reload all data. This alone is reason enough that I'll never again use ZFS where btrfs is available.
- Offline on-demand deduplication. Being able to dedup files when I want is very nice. cp --reflink is also super.
- Sane hardware requirements. ZFS is designed for extremely high quality hardware (and lots of RAM) that doesn't lie to the OS, which is just not what most of us are running. btrfs is designed for everyday use.

Features that I miss from ZFS:
- Online live deduplication. But it's sooo sloooow and requires so much memory, that I don't miss it much.

Asides from that they're pretty equal in my experience. They both offer transparent compression, which is what I really want.

Slashdot Top Deals

"You shouldn't make my toaster angry." -- Household security explained in "Johnny Quest"

Working...