Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×

Comment Re:Microcode switching (Score 2) 161

This same myth keeps being repeated by people who don't really understand the details on how processors internally work.

Actually, YOU are wrong.

You cannot just change the decoder, the instruction set affect the internals a lot:

All the reason you list could all be "fixed in software".

No, they cannot. OR the software will be terible slow , like 2-10 times slowdown.

The fact that silicon designed by Intel handles opcode in a way a little bit better optimized toward being fed from a x86-compatible frontend is just specific optimisation.

Opcodes are irrelevant. They are easy to translate. What matters are the differences in the semantics of the instructions.
X86 instructions update flags. This adds dependencies between instructions. Most RISC processoers do not have flags at all.
This is semantics of instructions, and they differ between ISA's.

Simply doing the same stuff with another RISCy back-end, i.e: interpreting the same ISA fed to the front-end, will simply require each x86 ISA being executed as a different set of micro-instructions. (some that are handled as single ALU opcode on Intel's silicon might require a few more instruction, but that's about the different).

The backend, the micro-instrucions in x86 CPUs are different than the instructions in RISC CPU's. They differ in the small details I tried to explain.

You could switch the frontend and speak a completely different instruction set. Simply if the two ISA are radically different, the result wouldn't be as efficient as a chip designed with that ISA in mind. (You would need a much bigger and less efficient microcode, because of all the reasons you list. They won't STOP intel from making a chip that speaks something else.

Intel did this, they added x86 decoder to their first itanium chips. And. They did not only add the frontend, they added some small pieces to their backend so that it could handle those strange x86 semantic cases nicely.
But the perfromance was still so terrible that nobody ever used it to run x86 code, and then they created a software translator that translated x86 code into itanium code, and that was faster, though still too slow.

Not only is this possible, but this was INDEED done.

There was an entire company called "Transmeta" whose business was centered around exactly that:
Their chip, the "Crusoe" was compatible with x86.
- But their chip was actually a VLIW chips, with the front-end being 100% pure software. Absolutely as remote from a pure x86 core as possible.'

The backend of Crusoe was designed completely x86 on mind, all the execution units contained the small quirks in a manner which made it easy to emulate x86 with it. The backend of Crusoe contains things like:

* 80-bit FPU,
* x86-compatible virtual memory page table format(one very important thing I forgot from my original list couple of posts ago; Memory accesses get VERY SLOW if you have to emulate virtual memory)
* support for partial register writes(to emulate 8- and 16-bit subregisters like al, ah,ax )

All these were made to make binary translation from x86 easy and reasonable fast.

Comment Re:This is a myth that is not true (Score 4, Informative) 161

Some of what you said is legitimate. Most of it is irrelevant, since it does not speak to the postulate. You're speaking of issues which will affect performance. So what? You'd have a less-performant processor in some cases, and it would be faster in others.

No.

1) if the codition codes work totally differently, they don't work.

2) The data paths needed for separate and compined FP and integer regs are so different that it makes absolutely NO sense to have them together in chip that runs x86 ISA, even though it's possible.

3) If you don't have those x86-compatible address calculation units, you have to break most of memory ops into more micro-ops OR even run them with microcode. Both are slow. And if you have a RISC chip you want to have only the address calculation units you need for your simple base+offset addressing.

4) In the basic RISC pipeline there are two operands, one output/instruction. There are no data paths for two results, you cannot execute operations with multiple outputs such as x86 muliply which produces 2 values(low and high part of result), unless you do something VERY SLOW.

6) IF your RISC instruction set says you have aligned memory operations, you design your LSU to have only those, as it makes the LSU's much smaller, simpler and faster. But you need unaligned accesses for x86.

9) If your FPU calculates with different bit width, it calculates wrongly.

And

Comment Re:isn't x86 RISC by now? (Score 1) 161

After AMD lost the license to manufacture Intel i486 processors, together with other people, they were forced to design their own chip from the ground up. So they basically used one of the 29k RISC processors and put an x86 frontend on it.

This was their plan, but it ended up being quite much harder than they originally thought, and K5 came out much later, much different and much slower than planned. There are quite a lot of thigns that have to be done differently (some of them are explained in my another post)

Comment This is a myth that is not true (Score 5, Informative) 161

That is correct. Every time this comes up I like to spark a debate over what I perceive as the uselessness of referring to an "instruction set architecture" because that is a bullshit, meaningless term and has been ever since we started making CPUs whose external instructions are decomposed into RISC micro-ops. You could switch out the decoder, leave the internal core completely unchanged, and have a CPU which speaks a different instruction set. It is not an instruction set architecture. That's why the architectures themselves have names. For example, K5 and up can all run x86 code, but none of them actually have logic for each x86 instruction. All of them are internally RISCy. Are they x86-compatible? Obviously. Are they internally x86? No, nothing is any more.

This same myth keeps being repeated by people who don't really understand the details on how processors internally work.

You cannot just change the decoder, the instruction set affect the internals a lot:

1) Condition handling is totally different on different instruciton sets. This affect the banckend a lot. X86 has flags registers, many other architectures have predicate registers, some predicate registers with different conditions.

2) There are totally different number of general purpose and floating point registers. The register renamer makes this a smaller difference, but then there is the fact that most RISC's use same registers for both FPU and integer, X86 has separate registers for both. And this totally separates them, the internal buses between the register files and function units in the processor are done very differently.

3) Memory addressing modes are very different. X86 still does relatively complex address calculations on single micro-operation, so it has more complex address calculation units.

4) Whether there are operations with more than 2 inputs, or more than 1 output has quite big impact on what kind of internal buses are needed, how many register read and write ports are needed.

5) There are a LOT of more complex instructions in X86 ISA which are not split into micro-ops but handled via microcode. the microcode interpreter is totally missing on pure RISCs ( but exists on some not-so pure RISC's like Powe/PowerPC).

6) Instruction set dictates the memory aligment rules. Architectures with more strict alignment rules can have simples load-store-units.

7) Instruction set dictatetes the multicore memory ordering rules. This may affect the load-store units, caches and buses.

8) Some instructions have different bitnesses in different architectures. For example x86 has N x X -> 2N wide multiply operations which most RISC's don't have. So x86 needs bigger/different multiplier than most RISCs.

9) X87 FPU values are 80-bit wide(truncated to 64-bit when storing/loading). Practically all the other CPU's have maximum of 64-bit wide FPU values (though some versions Power have support for 128-bit FP numbers also)

Comment The article is bad - mfg technology dominates (Score 2) 161

They are seriously comparing some 90nm process with much better intel 32nm and 45 nm processes.

They have just taken some random cores made on random (and uncomparable) manufacturing technologies, throw couple of benchmarks and try to declare universal results based on these.

Few facts about the benchmarks setup and the cores cores:

1) They use ancient version of GCC. ARM suffers this much more than x86.
2) Bobcat is relatively balanced core, no bad bottlenecks. mfg tech is cheap, not high performance but relatively small/new.
3) Cortex A8 and A9 are really starved by bad cache design. Newer A7 and A12 would be similar in area and powet consumption but much better in performance and performance/power. There are also manufactured on old cheap mfg processes, which hurt them. Use modern manufacturing tech and results are quite much better
4) Their loonson is made on ANCIENT technology. With modern mfg tech it would be many times better on performance/power.
5) The cortex A15, even though made on 32nm process, is cheap process, not much better than intel's 45nm process and much worse than intel's 32nm. Also it's known to be a "power hog"-design. Qualcomm's Krait has similar performance level, but with much lower power.

Comment Kaveri is much better as PC chip (Score 2) 105

- Single-thread performance matters much more than multi-thread performance, and Kaveri has almost twice the single-thread performance of the Xbone and PS4 chips.

- Memory bandwidth is expensive. You either need wide and expensive bus, or expensive low-capasity graphics DRAM which need soldering, and means you are limited to 4 GiB of memory(with the highest capasity GDDR chips out there), with zero possibility of late upgrading it, or both(and MAYBE get 8 giB of soldered memory). Though there has been rumours that Kaveri might support GDDR5, for configurations with only 4 GiB of soldered memory.

- And when you have that limited memory bandwidth, it does not make sense to waste die space on creating monster GPU which is starved by the lack of bandwidth.

- ALL the mentioned chips are of same generation. All support cache-coherent unified memory.

As PC chip, Kaveri makes much more sense:

- Software that matters on PC cannot use 8 threads. Kaveri is much faster at most software
- Weaker GPU side, ability to use cheap DDR3, and narrower memory bus makes Kaveri chip and kaveri-bases systems cheaper to manufacture
- The CPU can be socket, need to to be soldered, and the memory chips can use DIMMs instead of soldering to motherboard. Ability to upgrade something and system manufacturers to easily create different configurations.

Comment Aluminium is the fuel, not water (Score 5, Informative) 85

argh, again this kind of misleading headline that makes the people who only read the headline think a perpetual machine is finally invented.

The energy comes from aluminium, aluminium "burning" into aluminium-oxide.

Putting the "converting water into hydrogen" into headline is misleading reporting.

Comment The world needs a good and open mobile OS (Score 2) 63

WP has crappy multitasking, and all your data are belong to Microsoft.

With IOS all your data are belong to Apple. And everything is controller by Apple.

With Android all your data are belong to Google, and performance is bad.

With Symbian the user owns his data, but performance is bad, sw development is really pain, and UI is bad. RIP.

What is needed is operating system that allows the user to own his data, has good performance, and allows the user to use the device the way he wants.
Meego/Mer gives this.

Comment Jolla does not mean a rescue boat - too small (Score 3, Informative) 200

I'm a finn,so I know what "Jolla" means.

Jolla means a very small sailing boat - not meant for rescue, but meant for people who want to go sailing alone on a very small boat.
(who either cannot afford bigger boat or just likes very small boats)

Jollas cant be used as rescue boats, they are too small for that.

Comment Triathlon bike (Score 1) 356

Tires and most components like road racing bikes, but has aerobars and bullhorn bars instead of drop bars.

And has more agressive geometry, 78 degree sea post instead of 73 degree.

So it's faster than road racing bike, but illegal to use on road races.

Simply the fastest way to travel on own power, recumbrants may be faster on straight road, but they are too clumsy to for city traffic.

Comment Re:Take your time, let software catch up. (Score 3, Informative) 149

Software isn't the bottleneck. Caches are *tiny* compared to the size of even single functions in modern programs, which means they get flooded repeatedly, which in turn means that you're pulling from main memory a lot more than you'd like.

Wrong.

The code size of average function is much smaller than instruction cache for any modern processor.
And then there are L2 and L3 caches.

Instruction fetch needing to go to main memory is quite rare.

And then about data.. depends totally on what the program does.

Multi-core CPUs aren't (as a rule) fully independent - they share caches and share I/O lines, which in turn means that the effective capacity is slashed as a function of the number of active cores. Cheaper ones even share(d) the FPU, which was stupid.

None one of the CPU's sharing FPU with multiple HW threads are cheap.

Sun Niagara I had slow shared FPU, but the chip was not cheap

AMD Bulldozer, which usually has sucky performance, sucks less on code which uses the shared FPU.

FPU operations just have long latencies and there are always lots of data dependencies, so in practice you cannot
utilize FPU well from one threads, you need to feed instructions from multiple treads.

Intel uses HyperThreading for this, AMD Bulldozer it's CMT/shared FPU/module.
GPU's are barrel processors for the same reason.

The bottleneck problem is typically solved by increasing the size of the on-chip caches OR by adding an external cache between main memory and the CPU.

Much more often the bottleneck is between the levels of the chip's caches.
The big outer level caches are slow and processors spend quite often small time waiting for data coming from them. And if you increase the size of the last level caches, you make them even slower.

One of the reason's for bulldozer's sucky performance is because it has small L1 caches(so it needs to fetch data deom L2 cache often), but big and slow L2 cache. So there is this relatively long L2 latency happening quite often.

External cache.. has not been been used for about 10 years by Intel or AMD. It's either slow or expensive, and usually both. Now when even internal caches can easily be made with sizes over 10 megabytes, the external cache has to be very expensive in order to compete with internal caches, and still it only makes sense on some server workloads.

After that, it depends on whether the bottleneck is caused by bus contention or by slow RAM. Bus contention would require memory to be banked with each bank on an independent local bus. Slow RAM would require either faster RAM or smarter (PIM) RAM. (Smart RAM is RAM that is capable of performing very common operations internally without requiring the CPU. It's unpopular with manufacturers because they like cheap interchangeable parts and smart RAM is neither cheap nor interchangeable.)

Smart RAM is a dream, and a research topic in universities. It's uncommon because it does not (yet) exist.

And most of the problems/algorithms are not solveable by "simple" smart ram that can only operation on data near each others. And it you try to make it even smarter, then you end up making it costlier and slower, it will become just chip with multicore processor and memory on same chip.

There are some computational tasks where smart ram would improve the performance by great magnitude, but for the >90% of all the other problems, it has quite little use.

Comment End of a failed experiment, nobody loses (Score 1) 246

When nokia deciced to open source Symbian, they did not understand the ways how open source software/development model works.
You cannot take a closed-source code and think that when you open the source, suddently thousands of other people will come to help you and do your work for you.

The amount of people who actually downloaded the symbian os source code and succesfully compiled it themselves outside Nokia and some other big phone manufacturers can propably be calculated with fingers of one hand.

SymbianOS was piece of terribly written code that used very strange coding practices, and getting it to compile was quite hard. By just giving this kind of code to public does not create any kind of "open source momentum". No people are interested in contributing to it when they cannot get any benefit from their changes, and when making those changes is so hard.

So, in the end, nobody loses when the source is closed.

Slashdot Top Deals

If you have a procedure with 10 parameters, you probably missed some.

Working...