Slashdot is powered by your submissions, so send in your scoop


Forgot your password?

Comment Re:edit distance, not just matching (Score 1) 82

"first check whether the lengths are equal": I suppose if the string length is marked at the beginning of the string, that makes sense. But if not (e.g. the end of the string is marked by a null byte), doesn't that just slow things down? Because you have to traverse the strings twice: once to measure their length, and once to check for identity. I suppose it depends on the constant factor.

"it might be better to compare characters at the end first, under the idea that similar strings are more likely to match at the beginning": I think that depends entirely on the nature of the strings. If they're from a language that uses suffixes, maybe; if they're from a language that uses prefixes (Bantu, Athabaskan), probably not. It also presumes that you have a pointer to the end of the strings, which depends on your data representation. For DNA, I would guess it makes no difference at all which end you start at.

Comment Re:The algorithm isn't clever, but scales well. (Score 1) 82

This is one of many diffs (ahem) between DNA sequencing and natural language processing. Another is the alphabet size: DNA has an alphabet size of 4, while the alphabets (number of phonemes or graphemes) of natural languages range from a low of 11 (Rotokas and Mura) to a high of perhaps 140-something (!Xu, although the exact number depends on how you count things). Of course written Chinese and Japanese have much higher numbers of graphemes.

It's also the case that some writing systems don't mark word boundaries (Chinese, Thai for example), in which case the begin/end shortcut won't work at all. Which makes machine processing of such languages quite a bit harder. And of course word boundaries aren't usually indicated in fluent spoken language, except at phrasal pauses or the beginning/ end of utterances.

And finally, languages have a finite number of wordforms, although that number can be very high in languages that have lots of inflectional morphology.

That said, variants of the algorithms used for DNA sequencing are also used in computational linguistics.

Comment Re:Really editors? (Score 1) 45

One of the interesting things about a slide rule is that you get appear to get fractionally more significant digits at the 1-end than you do at the 9-end. That is, it's much easier to read off 3 digits at the 1-end than it is at the 9-end. But that doesn't really represent any increase in accuracy at the 1-end, it's just (afaik) an artifact of the way base-10 works. The ratio 1.11/1.10 (3 significant digits), say, is 1.0090909..., while the ratio 0.99/0.98 (2 significant digits) is 1.010204...; the ratio of those two ratios is 0.99889807... That is, a difference of 1 in the 3rd digit of a number near 1.1 is nearly the same as a difference of 1 in the 2nd digit of a number near 0.99.

A mathematician could probably put that more accurately, but it's sort of intuitively obvious when you look at a slide rule.

Comment Re:Oh, they're a big company, (Score 1) 527

I'm still on Win7, and at home I use LibreOffice. But the latest from MsOffice (on my work PC) is "We did(n't)...", like "We didn't find anything" when I do a search for some text in Outlook. This is like the stereotypical nurse. Why can't they just use passive voice? "Nothing found."

Comment Re:non-ASCII (Score 1) 211

Yeap, I understand: as a computational linguist I'm rare and definitely not the main target of these fontmongers. For instance, I do Python or fst programming in which I occasionally need to embed non-ASCII (and non-Latin) characters in the code, such as Bangla (Bengali) or Arabic script characters. In Python, there are ways of dealing with such characters without embedding the characters themselves (e.g. by referring to their code points)--but it's often simpler to just use the characters. For instance if I want to write out a multi-character affix it's easiest to do so using a string consisting of the characters. And in the fst programs (xfst/lexc and more recently sfst), afaik I have to write out the individual characters.

Comment Re:That's messed up (Score 1) 203

Why do you think MightyMartian is a fake name? I thought everyone knew they had a climate crisis on Mars centuries (if not millenia) ago, and the subsequent drought is why they dug the canals. The CO2 in their atmosphere was a too-late attempt to increase the temperature before global cooling, er climate change, took over. The canals and the oases that Schiaparelli and Lowell saw in the late 19th century dried up during the early 1900s, giving rise to the dust storms that confronted the Soviet Mars 2 and 3, and American Mariner 9 probes in the early 1970s. MightyMartian is doubtless one of the survivors, sent to warn us about climate change. (The notion that Mars was attacking was due to a bug in their translator.)

Comment Re:Alaska (Score 1) 203

Yeah, I'm wondering about that premise in the original article: (all?) dry areas will get drier, and (all?) wet areas wetter. How do we know it won't be the other way around? This can only be the prediction of a model, and the predictions from models, even the short term/ local ones, thus far don't strike me as very good. In particular, predictions about winter snowfall and hurricane seasons seem to have been way off for the last 10-20 years.

"Anyone attempting to generate random numbers by deterministic means is, of course, living in a state of sin." -- John Von Neumann