Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×

Comment Re:But... (Score 1) 261

I saw a recent review of a smartphone that had two screens, one LCD and one eInk. The modern eInk display is able to get a high enough refresh for interactive use and doesn't drain the battery when done. The screen that I'd love to see is eInk with a transparent OLED on top, so that text can be rendered with the eInk display and graphics / video overlaid on the OLED. The biggest problem with eInk is that the PPI is not high enough to make them colour yet. You get 1/3 (or 1/4 if you want a dedicated black) of the resolution when you make the colour and so that means you're going to need at least 600PPI to make them plausible.

The other problem that they've had is that LCDs have ramped up the resolution. My first eBook reader had a 166PPI eInk display. Now LCDs are over 300PPI but the Kindle Paperwhite is only 212PPI, so text looks crisper on the LCD than the eInk display, meaning that you're trading different annoyances rather than having the eInk be obviously superior. With real paper you get (at least, typically a lot more than) 300DPI and no backlight.

Comment Re:amazing (Score 1) 279

The problem here is latency. You're adding (at least) one cycle latency for each hop. For neural network simulation, you need to have all of the neurones fire in one cycle and then consume the result in the next cycle. If you have a small network of 100x100 fully connected neurones then the worst case (assuming wide enough network paths) with a rectangular arrangement is 198 cycles to get from corner to corner. That means that the neural network runs at around 1/200the the speed of the underlying substrate (i.e. your 200MHz FPGA can run a 1MHz neural network).

Your neurones also become very complex, as they need to all be network nodes with store and forward and they are going to have to handle multiple inputs every cycle (consider a node in the middle. In the first cycle it can be signalled by 8 others, in the next it can be signalled by 12 and so on. The exact number depends on how you wire the network, but for a flexible implementation you need to allow this.

Comment Re:Good grief... (Score 1) 681

What's the justification for compilation unit boundary? It seems like you could expose the layout of the struct (and therefore any compiler shenanigans) through other means within a compilation unit. offsetof comes to mind. :-)

That's the granularity at which you can do escape analysis accurately. One thing that my student explored was using different representations for the internal and public versions of the structure. Unless the pointer is marked volatile or any atomic operations occur that establish happens-before relationships that affect the pointer (you have to assume functions that you can't see the body of contain operations), C allows you to do a deep copy, work on the copy, and then copy the result back. He tried this to transform between column-major and row-major order for some image processing workloads. He got a speedup for the computation step, but the cost of the copying outweighed it (a programmable virtualised DMA controller might change this).

I suppose you could do that in C++ with template specialization. In fact, doesn't that happen today in C++11 and later, with movable types vs. copyable types in certain containers? Otherwise you couldn't have vector >. Granted, that specialization is based on a very specific trait, and without it the particular combination wouldn't even work.

The problem with C++ is that these decisions are made early. The fields of a collection are all visible (so that you can allocate it on the stack) and the algorithms are as well (so that you can inline them). These have nice properties for micro optimisation, but they mean that you miss macro optimisation opportunities.

To give a simple example, libstdc++ and libc++ use very different representations for std::string. The implementation in libstdc++ uses reference counting and lazy copying for the data. This made a lot of sense when most code was single threaded and caches were very small but now is far from optimal. The libc++ implementation (and possibly the new libstdc++ one - they're breaking the ABI at the moment) uses the short-string optimisation, where small strings are embedded in the object (so fit in a single cache line) and doesn't bother with the CoW trick (which costs cache coherency bus traffic and doesn't buy much saving anymore, especially now people use std::move or std::shared_ptr for the places where the optimisation would matter).

In Objective-C (and other late-bound languages) this optimisation can be done at run time. For example, if you use NSRegularExpression with GNUstep, it uses ICU to implement it. ICU has a UText object that implements an abstract text thing and has a callback to fill a buffer with a row of characters. We have a custom NSString subclass and a custom UText callback which do the bridging. The abstract NSString class has a method for getting a range of characters. The default implementation gets them one at a time, but most subclasses can get a run at once. The version that wraps UText does this by invoking the callback to fill the UText buffer and then copying. The version that wraps in the other direction just uses this method to fill the UText buffer. This ends up being a lot more efficient than if we'd had to copy between two entirely different implementations of a string.

Similarly, objects in a typical JavaScript implementation have a number of different representations (something like a struct for properties that are on a lot of objects, something like an array for properties indexed by numbers, something like a linked list for rare properties) and will change between these representations dynamically over the lifetime of an object. This is something that, of course, you can do in C/C++, but the language doesn't provide any support for making it easy.

Comment Re:Good grief... (Score 1) 681

Depends on whether they care about performance. To give a concrete example, look at AlphabetSoup, a project that started in Sun Labs (now Oracle Labs) to develop high-performance interpreters for late-bound dynamic languages on the JVM. A lot of the specialisation that it does has to do with efficiently using the branch predictor, but in their case it's more complicated because they also have to understand how the underlying JVM translates their constructs.

In general though, there are some constructs that it is easy for a JVM to map efficiently to modern hardware and some that are hard. For example, pointer chasing in data is inefficient in any language and there's little that the JVM can do about it (if you're lucky, it might be able to insert prefetching hints after a lot of profiling). Cache coherency can still cause false sharing, so you want to make sure that fields of your classes that are accessed in different threads are far apart and ones accessed together want to be close - a JVM will sometimes do this for you (I had a student work on this, but I don't know if any commercial JVM does it).

Comment Re:Good grief... (Score 1) 681

Heck, in C / C++, such as transformation is actually illegal

Actually, it isn't if the compiler can prove that the layout is not visible outside of the compilation unit. I did have a student work on this, but the performance gains were negligible in most C code because complex data structures tend to leak across compilation unit boundaries (this may be less true with LTO). Even then, if you can recognise data structures that are bad then you can probably teach programmers not to use them, or put them in a standard library where their implementations can be easily changed.

It's much more interesting in environments with on-the-fly compilation, because then you can adapt data structures to use. Even then, you can do it outside of the compiler (for example, the NeXT implementations of the Objective-C collection classes would switch between a few different internal representations based on the data that you put in them).

Comment Re:amazing (Score 1) 279

We have a *lot* of neurons with a *lot* of connections.

The second part is the important one. Neurones in the human brain have an average of 7,000 connections to other neurones. That's basically impossible to do on a silicon die, where you only have two dimensions to play with and paths can't cross - you end up needing to build very complex networks-on-chip to get anywhere close.

Comment Re:To answer your question (Score 0) 279

The tick/tock really tells you a lot about Intel's focus as a company. They're primarily a company that builds fabs and spends a lot on developing new process technology. Designing new processor architectures is something that they do almost as an afterthought. It tells you something about the skill of their design teams that AMD was able to be competitive for so long in spite of being 1-2 process generations behind Intel.

Comment Re:To answer your question (Score 3, Informative) 279

The Mill is interesting, but has a lot of limitations that are likely to shop up in general purpose code (e.g. try writing a signal handler, context switcher, or stack unwinder for The Mill and you'll have a lot of fun).

As to Transmeta, the company that bought them was nVidia. Their Project Denver chips use a lot of the Transmeta ideas. They're particularly interesting in terms of history, as the project was several years along before they decided on the ISA (they spent a while trying to license the relevant patents from Intel to build an x86 chip, failed and went with ARMv8 - which may end up being a strategic error for Intel). Unlike the Transmeta chips, it has a hardware ARM decoder that generates horribly inefficient VLIW instructions from ARM code. This helps alleviate the startup penalty that the older Transmeta chips had, where they had to JIT compile every instruction sequence the first time they encountered it and then run it from their translation cache. The nVidia chips can run the code as soon as they pull it into the instruction cache and can profile it before doing the translation.

Comment Re: To answer your question (Score 4, Insightful) 279

Your request makes no sense. You can always fit more processing power in a big case with lots of cooling than in a small case with very limited airflow (and power constraints on the fans). And it's always going to be cheaper to produce chips that can consume more power and dissipate more heat than ones with similar performance but a lower power budget. The only reason that the prices have become so close is that laptop sales passed desktop sales some years ago and now the economies of scale are on the side of the mobile parts.

If you want a laptop with the power of a desktop, just wait a couple of years and you'll be able to buy a laptop with the power of this generation's desktops. Of course, desktops will be even faster by then.

Comment Re:Good grief... (Score 1) 681

Bullshit (and I say this as a compiler writer). Very few compilers do anything with data layout at all (some JVMs do, to a limited degree, because they live in a closed world) and none outside of a few research projects will replace one data structure with another. What compiler are you using that will replace and XOR linked list or a skip list with something more efficient?

The belief in the compiler as a magic box that can turn a crappy algorithm into a good one is one of the things that a computer science education is meant to disabuse students of.

Comment Re:Good grief... (Score 2) 681

No, it really doesn't. If anything it's more relevant in other languages. For example, the cost of moving values from integer to floating point register files is a significant determining factor in JavaScript compiler design. To take JavaScriptCore as an example, the typical instruction cache size was one of the key inputs into the design of the interpreter and baseline JIT - it's written in a portable macro assembly language with precisely two design goals: the interpreter must have precise control over stack layout (so that deoptimisation can work easily) and the interpreter must fit entirely in the instruction cache of a modern CPU. The baseline JIT works by constructing a sequence of (predictable, because they have static destinations) jumps to the relevant entry points into the interpreter for a bytecode sequence. Trying to do this without understanding a reasonable amount of computer architecture would lead to all sorts of issues.

Comment Re:Good grief... (Score 4, Insightful) 681

Understanding how a transistor works requires quantum mechanics, but 'transistors are tiny magical switches' is enough to be able to understand how to build them up into gates, how to assemble gates into arithmetic, logic, and memory circuits, how to assemble those into pipelines, and so on.

Eventually you need quantum mechanics (or relativity, or both) to understand how anything works, but understanding the electron transfer involved in combustion is not essential to understanding how a car works. Computer science is all about building abstractions.

Comment Re:Good grief... (Score 2) 681

There are different degrees of knowledge. I don't think anyone can be a competent programmer without understanding things like caches, TLBs, and pipelines (and, in particular, branch prediction). These things have significant impacts on the performance of code - often a factor of ten. Trying to write software for some hypothetical abstract machine, rather than a real modern processor leaves you with something that has the CPU gently warming the room while it waits for data from RAM. For example, I've seen people who skipped that part of their education think that XOR linked lists and skip lists are still good data structures to use.

Comment Re:Good grief... (Score 4, Insightful) 681

You're paraphrasing Dijkstra, but missing his point. Astronomers, in general, know a heck of a lot about optics. His point wasn't to excuse ignorance of how computers work (he worked on the design of the STANTEC ZEBRA and wrote an incredibly scathing review of the IBM1620, for example, so clearly knew his way around the design process), it was to point out that this is a building block.

I'd consider any computer science curriculum that doesn't cover logic gates up to building adders, the basics of pipelining, the memory hierarchy and virtual memory translation at a minimum to have seriously skimped over computer architecture. The better ones will include design and simulation (on FPGA if budgets permit) of a simple pipelined processor.

If you want to work on compilers or operating systems, to give just two examples, then you need a solid grasp of computer architecture.

Slashdot Top Deals

"Here's something to think about: How come you never see a headline like `Psychic Wins Lottery.'" -- Comedian Jay Leno

Working...