Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
User Journal

Journal Journal: Dr SimianOverlord, back from the dead

2006. Holy shit. Did I really use to be on Slashdot? Back then, I was working just outside of London - as the name suggests, using primates as a model of HIV infection. The funny thing with research is that you're busy all the time, but with significant downtime in between. I should have been reading the literature, catching up on my reports or presentations, or figuring out some other experiment I could slip in between my main ones. But I wasn't. I had discovered slashdot.

Slashdot filled that entertainment gap for me, since I was too lazy to check more than one site, and I was an asshole who liked to give my opinions to other people. And boy, did I post a whole lot of crap, now I come back to look over it. Most of it is just amusing to me now, I can't even remember posting half of it, nor can I really remember how I knew what I was posting about, since I don't know it any longer. Perhaps the daily wash of topic summaries that gave me that knowledge has been wiped away by the past few years.

I ended up leaving my job, where I was a lowly research assistant because I did a PhD. No point in pursuing research if you don't have one - you're following orders for the rest of your life. So now 4 years later I have the piece of paper and the sense of anticlimax. It isn't a big deal to anyone but my parents and employers, now that I've gone through the crucible. It was the same old shit, only far more complex, far more arduous and demanding far more stamina then I could ever have realised (and I'm glad I didn't, because I wouldn't have done it.)

I'm now the Internet's passing expert in a minute part of a minute field of human research, I've got some published research and I'm getting at last the Post Doc jobs. I went from HIV research right out to left field, into Cutaneous research, specifically proteins of the terminally differentiating keratinocytes that make up your skin surface. What I research won't cure cancer, it won't really help in wound healing, inflammation, allergic responses, squamous cell carcinomas, the horror of skin diseases like psoriasis, atopic dermatitis or anything else. Probably. But you never know with basic research. You just never know.

User Journal

Journal Journal: Ask Slashdot: Reinstalling over a running system 4

I have used chroots several times to install new systems alongside running systems before, finishing by rebooting into the new partition. For the first time, however, I find myself wanting to reintall over a running system, and changing arch'es, too. This sounds impossible to my ears, so I'm calling on some help from my friends here. Any ideas on how to make this work?

Education

Journal Journal: Why are Thais Poorly Educated? Their Textbooks Suck. 4

From a college-level textbook in the introductory course "Life Skills:"
  1. Dogs and cats can be said not to be able to think because when we tell them to go to the refrigerator to get us something to eat, they don't understand and don't comply.
  2. Computers can be said to think because computers can defeat humans in chess, but the computer doesn't really understand what it is doing.
  3. Therefore, it can be said that only humans and angels can think.

Let's test your understanding using a quiz from the book. Which of the following is true?

  1. A dog thinks when it puts shoes in the trash because the shoes stink.
  2. A boy doesn't want to go to school because he doesn't understand the teacher.
  3. A human defeats a computer in calculations.
  4. A fish comes to the surface of the water when it is called.

The answers? 2 and 3. 1 and 4 are false because animals don't think.

This isn't some dinky seminary, folks: it's the state open university. Wow. Just ... wow. You've got to love state religion.

Software

Journal Journal: Time to PAUSE, and DFS Progress (of sorts)

I got a PAUSE account!

As of today I am "TTKCIAR" at pause.perl.org, a full-fledged member of the open source perl community, and capable of contributing software to CPAN, the central repository of perl modules, for others to download and use! Exciting day!

I have a bunch of code to contribute too .. but there's one problem: most of it totally falls short of PAUSE's standards for contributed code.

Look at the KS module, for instance. KS is a repository of handy functions which I have accumulated over the last nine years. They're useful and mature functions, but scarcely documented and need to be broken out into category-specific submodules. It's not "the way" to have functions for converting Brinell Hardness units to Vickers Hardness units rubbing elbows with file-locking functions and networking functions and string manipulation functions, all in the same module. The unit conversion functions need to go into modules in the "Physics" namespace, and the network functions need to go into modules in the "Net" namespace, etc. The documentation needs to be brought up to PAUSE standards as well.

It's work I knew I needed to do, but it was easy to put it off as long as I didn't have a PAUSE account. But now that I do, there's no more putting it off! I just need to find time to do it.

The first module I publish might be a relatively young one, a concurrency library called Dopkit. It's something I've been wanting to write for years, but I just finished writing and debugging it yesterday. There are many concurrency modules in CPAN already, but most of them require considerable programming overhead and require that the programmer wrap their head around the way concurrency works. These are reasonable things to do, but I've often thought it would be nice if it could be made trivially easy for the programmer to make loop iterations perform in parallel, without changing from the familiar loop construct. Dopkit ("Do Parallel Kit") provides functions that look and act like the familiar perl loop syntax -- do, for, foreach, and while -- and chops up the loop into parts which execute concurrently on however many cores the machine has. The idea is to put very few demands on the programmer, who needs only to load the module, create a dopkit object, and then use dop(), forp(), foreachp, and whilefilep() where they'd normally use do, for, foreach, and while(defined(<SOMEFILE>)). There are some limitations to the implementation, so the programmer can't use Dopkit *everywhere* they'd normally use a loop, but within its limitations Dopkit is an easy and powerful way to get code running on multiple cores fast.

Dopkit suffers from the same documentation deficit as KS, but at least it's already "categorized" -- as soon as I can get the documentation written, it should be published as Parallel::Dopkit. KS will take significant refactoring.

Most of the perl in my codecloset is embarrassingly primitive (I wrote most of it in my early days of perl, before I was very proficient with it), but there are a few other modules on my priority list to get into shape and publish. My "dy" utility has proven a tremendously useful tool over the years, but is in desperate need of rewriting. It started out life as a tiny throwaway script, and grew features organically without any design or regard for maintainability. I've been rewriting its functionality in two modules, FileID and DY (which should probably get renamed to something more verbose). When they're done, the "dy" utility itself should be trivially implementable as a light wrapper around these two modules. Another tool I use almost every day is "select", which is also in need of being rewritten as a module. I haven't started that one yet.

In other news I stopped dorking around with FUSE and linux drivers, and dug into the guts of my distributed filesystem project. Instead of worrying about how to make it available at the OS level for now, I've simply written a perl module for abstracting the perl filesystem API. As long as my applications use the methods from my FS::AnyFS module instead of perl's standard file functions, transitioning them from using the OS's "native" filesystems to the distributed filesystem should be seamless. This is only an interim measure. I want to make the DFS a full-fledged operating-system-level filesystem eventually, but right now that's getting in the way of development. Writing a linux filesystem driver will come later. Right now I'm quite pleased to be spending my time on the code which actually stores and retrieves file data.

Questions posted by other slashdot users focussed my attention on how I expect to distinguish my DFS from the various other distributed filesystem projects out there (like BTRFS and HadoopFS). I want it to do a few core things that others do not:

(1) I want it to utilize the Reed-Solomon algorithm so it can provide RAID5 and RAID6-like functionality. This will produce a data cluster which could lose any two or three (or however many the system administrators specify) servers without losing the ability to serve data, without the need to store all data in triplicate or quadrupilate. BTRFS only provides RAID0, RAID1, and RAID10 style redundancy -- if you want the ability to lose two BTRFS servers without losing the ability to serve all your data, all data has to be stored in triplicate. That is not a limitation I'm willing to tolerate. Similarly, the other distributed filesystems have "special" nodes which the entire cluster depends on. These special servers represent SPOFs -- "Single Points Of Failure". If the "master" server goes down, the entire distributed filesystem is unusable. Avoiding SPOFs is a mostly-solved problem. For many applications (such as database and web servers), IPVS and Keepalived provide both, load-balancing and rapid failover capability. There's no reason not to have similar rapid failover for the "special" nodes in a distributed filesystem.

(2) I want the filesystem to be continuous. Adding storage, replacing hardware, or allocating storage between multiple filesystem instances should not require interruption of service. This is a necessary feature if the filesystem is to be used for mission-critical applications expected to stay running 24x7. Fortunately I've done a lot of this sort of thing, and haven't needed to strain thusfar to achieve it. (On a related note, I still chuckle at the memory of Brewster calling me in the middle of the night from Amsterdam in a near-panic, following The PetaBox's first deployment. The system kept connecting to The Archive's cluster in San Francisco and keeping itself in sync, and nothing brewster could do would make it stop. The data cluster's software interpreted all of his attempts to turn the service off as "system failures" which it promptly auto-corrected and restored. It was a simple matter to tell the service to stop, but Brewster has a thing against documentation.)

(3) I want the filesystem to perform well with large numbers of small files. This is the hard part for filesystems in general, and it's something I've struggled with for years on production systems. None of the existing filesystems handle large sets of very small files very well, and most distributed filesystems such as RAID5 do not address the problem (and in some ways compound it -- as RAID5 arrays get larger, the minimum data that must be read/written for any operation also gets larger). In my experience, most real-life production systems have to deal with large numbers of small files. Just running stat() on a few million files is a disgustingly resource-intensive exercise. RAID1 helps, but the CPU quickly becomes the bottleneck. One of my strongest motivations for developing my own filesystem is to address this problem. I don't want to be struggling with it for the next ten years. I am tackling this problem in three ways: First, filesystem metadata is replicated across multiple nodes, for concurrent read-access. Second, filesystem metadata is stored in a much more compact format than the traditional inode permits. Many file attributes are inherited from the directory, and attribute data is only stored on a per-file basis when it is different from the directory's. This should improve its utilization of main and cache memories. Third, the filesystem API provides low-level methods for performing operations on files in batches, and implementations of standard filesystem functions (such as stat()) could take advantage of these to provide superior performance. For instance, when stat() was called to return information about a file, the filesystem could provide that information for many of the files in the same directory. This information would be cached in the calling process' memor space by the library implementing stat() (with mechanisms in place for invalidating that cache should the filesystem metadata change), and subsequent calls to stat() would return locally cached information when possible. This wouldn't help in all situations, but it would help when the calling application was trying to stat() all of the files in a directory heirarchy -- a common case where high performance would be appreciated.

I don't know how long it will take to implement such a system. What work I've already done is satisfying, but it just scratches the surface of what needs to be done, and I can barely find time to refactor and comment my perl modules, much less spend hard hours on design work! But I'll keep at it until it's done or until the industry comes up with something which renders it moot.

PC Games (Games)

Journal Journal: Constant Moderator Points 2

I keep getting moderator points. I'll get ten or fifteen, then another set a day or so after the expiration date of the first set. This has been going on for over a week. I think I'm on my fourth set of mod points in a row (though I obviously didn't start counting immediately).

That, and the random redundant mods on my old posts continue.

PC Games (Games)

Journal Journal: Amazon Music Store Breaks Banshee Plug-in?

A couple of days ago, I saw on a Banshee developer blog that an Amazon music store plug-in was intentionally broken by the Amazon Store. Does anyone know what this is about? There were no details in the blog and it wouldn't accept comments so I couldn't ask.

User Journal

Journal Journal: Bummed 1

I used to come to Slashdot looking for insightful comments on tech stories. I was waiting for Chrome OS to show up on the front page so that I could get some insight (there's not enough information about the OS to get anything really informative). I got nothing but trash. Anything that was insightful got modded "Troll." Half-thought-out posts with no information got modded "Insightful" and "Informative."

What a piece of shit this place has become. I have to hold my nose to read it. Why do I bother?

User Journal

Journal Journal: Finally leaving Korea

I've given my notice at work and will be leaving Korea at the end of August. Five years here is more than enough. I'll be heading back to Thailand to work and get my M.Ed. I'm weighing the options for my education right now. I may enter a satellite program from a U.S. school, or I may hire a Thai tutor and just get my teacher certification from a Thai university. Then there's the easier but less-attractive third option of enrolling in an international (i.e. English) program through a Thai university.

I'll make the decision in a week or two after looking everything over.

Microsoft

Journal Journal: How Does Bing Get Around robots.txt? 1

Having received a large number of people to my site who don't know what a FAQ is recently, I was combing through my access logs and found a growing number of people finding my web site from Microsoft's Bing search engine. That's odd. I've got msnbot blocked in my robots.txt. A scan of the logs show that msnbot (and its variants like msnbot-media, etc.) continue to check robots.txt, but nothing else. So how did Bing index my site?

Here are my current theories:

  • Ignore robots.txt and index anyway (unlikely)
  • Use semantic web technology like Google Wave's spell checker to infer things about my site based on other pages' links to it
  • Use information "phoned home" from users' browsing (Microsoft's EULAs allow for it)

Ignoring robots.txt is the most straight forward, but also the least likely. While they may "accidentally" index forbidden pages now and then (see the Bing forums - they do), even Microsoft's evil has its limits.

Besides that, a Ms. W. from Murphy and Associates on behalf of MSN Live Search contacted me over a year ago requesting that I allow msnbot to scan my site. I kindly said, "No thank you," and listed just a few of the crimes against humanity (most in the 1990s) that directly effected me and let her know that I couldn't be paid to help Microsoft in any way, including allowing their search engine to index my site. Quality over quantity.

I tried contacting Ms. W. after finding all of these Bing referrals, but either she is no longer with them or would prefer not to get involved in my dispute with Microsoft. Nonetheless, I have made an effort to remind Microsoft about my policy, and that I an none too pleased that they are still indexing my site.

But if they aren't doing it through the straight forward method, then how?

Well, the recent Google Wave spell checker demonstration had me thinking of other uses of semantic web technologies. It seems to me that much can be inferred about blind spots on the web if a grip on the context of pages can be made. So without indexing my pages, the links that use my site as a primary source can contribute to an inferred index about what my site contains. The text of anchor tags that link to my site would be an excellent source for high quality query keywords, linking directly to the most relevant information.

(Sergey, if you aren't working on such technology, I make no claim to the ideas here. They seem like a natural extension of your Wave spell checker work. Please just be sure to exclude links to any pages a site's robots.txt forbids. We don't want to start being evil, now.)

This theory is certainly doable. And even Microsoft techies can index links from indexed pages and put them into the results page without much trouble. And having no conscience to speak of, they would never think that maybe they should cross check robots.txt to prevent unwanted indexes from happening.

That leads to the third potential method of gathering information, the EULA. Many people have discussed Microsoft's ever changing End User License Agreements for their products, and how much information they allow Microsoft to gather about their users' working habits. Pretty much everything now "phones home" with information that Microsoft will tell you is meant to make your computing life better.

Well, with "permission" from millions of people to track their browsing habits (through IE, their "security" offerings, or even special proxy servers of partner ISPs), Microsoft wouldn't have to crawl any sites. They could just let their users browse the web and send back just the indexes. How much is currently understood about what Microsoft products are sending back to Redmond?

Of the three methods, the first strikes me as unlikely; the second is the most interesting, and most Googley; and the third strikes me as the most likely thing that Microsoft would do. It's all perfectly legitimate.

Well, it's legitimate except that I would rather not have Microsoft benefiting from any work that I do. I have tried to contact one of their agents to let them know I am none too pleased with the current situation, but I was ignored. And signing up for an MSN Live account to post on their Bing site isn't going to happen - I refuse their EULAs across the board.

So how can I get Microsoft to pay attention? For the time being, I'm rerouting all traffic with REFERER containing "bing.com" to Google. There are worse places I could send them, but I don't want to be cruel.

So, are there any other ways of getting around robots.txt do you think Microsoft employs? What other remedies are there to prevent Microsoft from using such circumventions?

In the mean time, here's the enjoyable Bing Bang.

User Journal

Journal Journal: More Lazyweb -- Ten year old scriptable disk imaging system 1

About eight years ago, I was using a piece of software to replicate disks. It was not GPL, but it was free to use and designed for clusters. You either booted a floppy or did netboot and loaded scripts off of a server to image the system. It supported multicast, which was huge for me at that time. I can't remember the name (though I seem to recall a "B" -- hahaha) and Googling such generic terms gives me nothing but newer results.

Ring any bells?

Oh, and I figured out those jacks! They were three-pronged power plugs.

User Journal

Journal Journal: Lazyweb, help me identify some jacks in a picture. 6

I got my hands on some pictures of a prototype keyboard computer, but there are two jacks on the back that I just don't get. On is labelled "TV" and has three holes in a triangle. There's also an s-video jack, so that's not it. The other jack is labelled "DE-RN" and I have no idea. Googling for it didn't bring up anything relevant at all.

HELP!!!!1111!!! ;)

User Journal

Journal Journal: Oh, the embarrassment 8

Well, my friend wanted me to change her operating system. My other friend got his new Mac and has been raving about it non-stop to her, and "they" decided she'd had enough of Windows, (despite neither of them being tech-smart in any real way). Since I didn't know anything about her hardware, I went to her place equipped with four options:

  1. A hackintosh disk. She's flying back to Canada and buying a new computer in a couple of months, and based on what my other friend has said, she's leaning toward a Mac. I though that giving her a couple of months to try out the OS would be cool, even if it's completely pirated. (My reasoning? She'd probably get used to it and give Apple the sale so that they made money off the deal.)
  2. A hastily-burned version of the Ubuntu 9.04RC which I didn't have time to md5sum. I figured that she was more likely to find info about Ubuntu on-line and among friends than any other Linux variant. Since the thing is due out in a couple of weeks and I'm running it on one machine now with few problems, I figured there wouldn't be any issues.
  3. An Ubuntu 8.10 DVD I got out of a Linux Format magazine when I visited Thailand. It didn't have Asian language support built in, but she didn't want that anyway.
  4. An OEM Windows CD in case she changed her mind or something.

Backing up was a nightmare. I guess there must be tools for it, but finding all the places here documents, music, and photos were was tedious. It was made simpler when she told me that she only wanted to save a couple of document folders and that everything else could go. I still backed up as much of the stuff as I had space for on a couple of flash drives, then I double checked by asking to make sure that I had backed up exactly what she wanted.

OS X installed but didn't boot once installed, and since I don't know anything about it, I just went to option #2. The disk was corrupted (of course) and wouldn't boot. 8.10 installed, but I was so worried about the stupid PulseAudio problem for her that I immediately upgraded to 9.04 (she got download speeds of 4000+KB/sec from the local mirror ... wow). She watches American TV via emule normally, so I installed Miro and taught her how to set up with TVRSS. Then I went to dinner and asked her to play around for a little while so that she would have some questions for me to answer when I got back.

The first question when I got through the door was "Where are my photos?"

"Well I saved as much as I had space for, and they're in the Photos directory. You can use the photo manager application to work with them." I'd already imported them.

"Not those. I had about 2000 photos of all my vacations."

There was a blank stare from me and a sinking feeling in my stomach as I meekly muttered "You didn't ask me to save any photos ...."

She swore that she did, but that didn't matter. She got past it quickly and I think she truly forgave me. There were two points that I could have used to avoid all her pain, though:

  1. When I was staring at Picasa, I thought "I could back everything up online for her. Nah. She said it wasn't important."
  2. I could have gone with Ubuntu first and set up dual boot. If I'd done that, all the data would have still been there.

I'm an idiot for thinking both times ... "Nah, she said it was OK." Never trust the user.

Slashdot Top Deals

"If I do not want others to quote me, I do not speak." -- Phil Wayne

Working...