Will We Ever Get Rid Of ASCII? 38
GeZ asks: "When will Unicode finally replace ASCII? When will 7-bit-encoded text finally disappear? When will 'extented' chars (like 'é' or 'ß', etc) be recognized as 'alphanumerics', letting us use all characters we want for file names, functions names, and DNS names? Most top-level modern apps and standards use Unicode so it deserves to be integrated at the lowest level, now. I really think old ASCII is too limited and fragmented to be useful. Using metachars in an ASCII file (a la HTML entity) is a boring way to solve the problem. A perfect integration with OSes (and base libraries) will "magically" make nearly all apps Unicode compliant, no? Yes, text chars will be encoded on 16 bits intead of 7 or 8 and would double text file size, but is this really troublesome, given today's storage medias?" Do any of you think that Unicode will completely replace ASCII or are there reasons why it's still in use as the primary way to represent text characters?
Ideo scripts need characters. (Score:2)
Who really needs 65536 different characters?
again, a char is supposed to be the same size as a byte
Nowhere in the C standard does it say that bytes must be 8 bits. Some C/C++ compilers for DSP architectures set char = short = int = long = 32 bits and still comply with the standard. There's also the wchar_t data type.
Re:Worse (Score:2)
Unicode is one of "semi-proprietary" standards -- documents aren't available for free (be it ones from ISO or Unicode), however there is no legal barriers for making an implementation -- just the size of the table makes a job of creating fonts unreasonably huge. OTOH, the tables necessary for determining, what the characters are, are available for free.
The problem however is different -- people already use their own charsets, and those charsets were designed to reflect the structure of language, or just to be most convenient for their language, sometimes made quite different from part of Unicode that is supposed to be used for the same language. If instead of trying to _convert_ everything to Unicode, people adopted a reasonable (iso 2022 isn't reasonable) way to label, which charset and which language are where in their strings, the implementation would be able to use all known charsets, and programs that aren't concerned with operation thats depend on them can just ignore the whole thing and treat text as a sequence of bytes until charset-specific procedures are called to process/display/compare/convert/input/... the text where "real" size and mapping of characters will emerge -- and those procedures can be language-dependent, replaceable and expandable if they will implement an easy mechanism of mapping charset/language names to sets of procedures. Unicode could be used as one of possible charsets, and UTF-8 could be used as one of possible encodings in such a system, however it won't be "the" thing, that everyone is supposed to support and be aware about. At most some programs would have to know how label delimiters look.
It can be a very easy solution for the real problem, however it requires an agreement on how charsets/languages should be labeled (their "real" names should be used to make the thing expandable, however how those things should be separated from "normal" text remains a question).
Switching to ASCII might be faster (Score:1)
Transliterating to Roman letters and using ASCII (especially since more people already speak English as a second language than any other in the world) may simply be simpler and faster.
OK... (Score:2)
Your explanation helps a lot - not that I have any use for such info, but I was curious about it. Between your explanation and the byte conversion array the other guy gave, I should be able to figure it out further.
Now, what does this say about EBCDIC and ASCII, about which came first? It sounds like ASCII came first - but what is the real answer?
Thank you! (Score:2)
Worse (Score:2)
Well, mebbe not. I am waiting to hear more comments from non-English slashdotters on this subject, the comments so far reflect a definite world view -- the English world.
Non-English slashdotters that at the same time use iso8859-1, most likely see Unicode (or UTF-8) as a good thing because first 256 characters of Unicode are the same as iso8859-1, and they don't give a damn about everything else, while non-English slashdotters that use other local encodings/charsets (like me, whose native language is Russian, with koi8-r as the charset used in unixlike systems) see Unicode as a monstrosity, forced on them by a bunch of dumbasses at Unicode Consortium, ISO and software vendors thart benefit from every incompatibility that can force people to upgrade.
If charset/language labeling was standardized, everyone would be able to use their own charset, and all software that is not directly involved in text editing/displaying would be able to continue working as it was before, however by STUPID decision made by "standard bodies" the priority is given to sticking "should support Unicode/UTF-8" into every standard in the place of "should pass the data as a stream of bytes regardless of the actual size of character, encoding and their possible meaning, except special characters involved in protocol" that would actually accomplish something.
Re:Never kill ASCII, please! (Score:1)
Re:Ideo scripts need characters. (Score:1)
8 bits - char
16 bits - int
32 bits - long int
I've also heard of a 80 bit integer, but I'm sure it's rumor and hearsay.
When the pack animals stampede, it's time to soak the ground with blood to save the world. We fight, we die, we break our cursed bonds.
Re:No, you haven't (Score:1)
In fact, I'm going to continue my lobby for a separate language - American! Heck, we never use 'bobby'm 'lorry' or 'wc'. Not to mention 'football' means something different in American than English or any other language. We'll stick with our 'coney dogs' and other fun local slang.
[/stupidity]
We should at least make English the national language, and make it a crime to conduct schools and buisnesses in other languages [/John Rocker]
Guess I can't win... though I'd rather not. ASCII isn't going away anytime soon, but Unicode is a Good Thing TM.
ZZT (Score:1)
OT, but checkout www.planetzztpp.com, they're working on a Linux version!
Re:Forward Into the Past! (Score:1)
Before converting, we need an interface... (Score:2)
But someone, please, tell me the easiest way to type ü (u-umulat) in Windows? One of the things that I do on my Mac that shocks people is just type with the flow foreign characters (opt-u, u is u-umulat, opt-u, e is e-umulat, etc). I think one of the reasons no one wants to move from ASCII to anything else is because it's rather hard to type in anything else.
Just my
ls:
Re:I think unicode would be best, due to utf-8 (Score:1)
Re:I think unicode would be best, due to utf-8 (Score:1)
I'm confused. ASCII is 7 bit, right? If UTF-8 is eight bit then I don't see how the two are always interchangeable, unless UTF-8 always zeroes the high bit. If that's the case, why not UTF-7?
Re:Before converting, we need an interface... (Score:1)
Never kill ASCII, please! (Score:2)
The limitations in ASCII makes searching texts and code a lot easier. I _like_ restrictions for function and variable names.
Of course something like is_ascii might just be enough for such a backwards compatibility hack.
I think unicode would be best, due to utf-8 (Score:3)
Since unicode is a variable length encoding, utf-8 can look exactly like ascii to an ascii machine.
The best part is that utf-8 requires no change. All ascii programs can read utf-8 and all utf-8 programs can read ascii. So therefore all unicode programs can read and write ascii. And all ascii programs can read and write a unicode subset.
To top it off, if a file does use the extended unicode stuff (>8 bits) then it will just look like line noise to an ascii machine, and a normal document in whatever language to a unicode machine.
The file size increase wont happen for ascii characters, but an additional 8 bits is needed for extended characters.
In conclusion, Unicode will completely replace ascii, and almost no one (in english speaking countries at least) will notice.
Example:
ascii A == 65. or 1000001
unicode/utf-8 A == 65, or 1000001.
There wont be any problems here.
ASCII to EBCDIC conversion (Score:1)
static char
_atoe_[] = {
0x4b, 0x01, 0x02, 0x03, 0x37, 0x2d, 0x2e, 0x2f,
0x16, 0x05, 0x25, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
0x10, 0x11, 0x12, 0x13, 0x3c, 0x3d, 0x32, 0x26,
0x18, 0x19, 0x3f, 0x27, 0x4b, 0x4b, 0x4b, 0x4b,
0x40, 0x5a, 0x7f, 0x7b, 0x5b, 0x6c, 0x50, 0x7d,
0x4d, 0x5d, 0x5c, 0x4e, 0x6b, 0x60, 0x4b, 0x61,
0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7,
0xf8, 0xf9, 0x7a, 0x5e, 0x4c, 0x7e, 0x6e, 0x6f,
0x7c, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7,
0xc8, 0xc9, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6,
0xd7, 0xd8, 0xd9, 0xe2, 0xe3, 0xe4, 0xe5, 0xe6,
0xe7, 0xe8, 0xe9, 0x4b, 0xe0, 0x4b, 0x5f, 0x6d,
0x79, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87,
0x88, 0x89, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96,
0x97, 0x98, 0x99, 0xa2, 0xa3, 0xa4, 0xa5, 0xa6,
0xa7, 0xa8, 0xa9, 0xc0, 0x6a, 0xd0, 0xa1, 0x07,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b,
0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b, 0x4b
Re:Never kill ASCII, please! (Score:1)
Look at EBCDIC. Still used for terminals in businesses. Look in a Wards or a Home Depot. They're everywhere.
Teacher: "So the government wanted IBM to make an encryption standard, and IBM did. It was named..."
Student: "EBCDIC?"
Answer: The US of A (Score:1)
Of course, I haven't got the numbers to back this up.
--Bud
Why would we want to? (Score:1)
Cheers,
Rick Kirkland
No, you haven't (Score:1)
Re:I think unicode would be best, due to utf-8 (Score:2)
Not quite right. Unicode is fixed-size (16 bits), UTF-8 is an variable length encoding of Unicode which, _if the text consists entirely of the 7-bit ASCII subset, will look exactly like ASCII. Other characters (in the larger range, around 0x6000 to 0xFFFF [I'm guessing]) will take up to 3 (maybe 4?) bytes to represent.
Magically compliant... (Score:2)
No.
Remember, there's a large amount of plain ol' text lying around. Heck, all of the web (including Slashdot) is essentially just ASCII with SGML entities. Nobody will suggest converting all of this to straight Unicode.
This is why there's UTF-8, a variable-length version of Unicode that's essentially backwards-compatible.
But that's not the whole problem. You mention implementing Unicode/UTF-8 in libraries and OS'es to get "magical compliance." No such luck. If you take a lot of code out there (including some of my own), it makes assumptions that byte=char. So people use char * and perform pointer additions and so on to parse. This is fine when you have 8-bit text. But what happens when you go to 16-bit text or in the case of UTF-8 variable-length chars? Things break.
However, getting good solid implementations of UTF-8 in core libraries and OS'es will help a lot. Right now there really isn't one standard API for treating UTF-8 text. The new glibc has a good implementation, but if you want to write portable code, this is a problem--you don't have glibc on all systems (e.g. *BSD, Solaris...).
But the day will soon come when programs that are not Unicode/UTF-8 compliant are in the minority.
-Geoff
ASCII is good... (Score:1)
ASCII uses 7 or 8 bits (usually 8 now). Unicode uses 16--twice as much. For things like embedded systems where memory can be in short supply, there no need to double the space used for storage.
Yes, you can do clever tricks like compress the data, and make up your own encoding scheme. And yes, memory is (relatively) cheap these days, but even so...
Maybe... (Score:1)
It would be a mistake. (Score:1)
Switching to Unicode would be like putting a gun to your head and pulling the trigger.
When the pack animals stampede, it's time to soak the ground with blood to save the world. We fight, we die, we break our cursed bonds.
One thing I have wondered (slightly offtopic)... (Score:2)
bigger -is- better (Score:1)
There is an opportunity (or was, some vendors have missed it) when converting an O/S from 32-bit to 64-bit to also build in support for Unicode. Let's face it, current multi-byte encoding schemes (1, 2, or 3 bytes per character, varying) are a pain, but Unicode is a breeze to use. Try writing some internationalized Java if you don't believe me.
Internationalization (Score:1)
The problem is that to do so involved programming effort, and with the quick development cycle of today, it makes use of Unicode really unlikely unless it is a very international company.
-L
Re:Magically compliant... (Score:1)
BTW '+' is an abomination - its not actually in the spec (go check if you don't believe me). It was in an internet-draft that didnt make it to the RFC stage, but some 'early adopter' foisted it upon us, so most browsers support it.
There has been some work on DNS standards to include unicode names, which would then be used in URLs, although the proposal there is somewhat different (essentially '-xx' instead of '%xx'). See http://search.ietf.org/internet-drafts/draft-osca
Forward Into the Past! (Score:1)
Build computers that address memory in 16-bit chunks!
char == 16 bits
short == 2 chars == 32 bits
long == 2 shorts == 64 bits
int == pointer == 32/64 bits, depending on model of CPU
This is exactly the way C worked on PDP-11s, etc. All existing code would recompile just fine, but would "magically" start using Unicode instead of ASCII. Yeah, the table that's used by isascii and its friends would suddenly grow to 64KB (remember, those are 16-bit bytes, not 8-bit!), but memory's cheap and getting cheaper. The memory that would be used by such a table is cheaper today than the 256 byte version was in 1970.
Re:Magically compliant... (Score:1)
A less critical point that I've seen in my own code is some of it will only process data if it is below 0x7f (i.e. first 7 bits). Usually, it processes a subset of these and ignores the rest. While this wouldn't break in Unicode, it would ignore everything but the first 7 bits.
Re:Worse (Score:1)
While in principal I feel there is a genuine need to use a globally-unified character set, I've heard that Unicode is proprietary. Is this true? If so, how does it affect attempts to support it in, for example, Linux?
Re:One thing I have wondered (slightly offtopic).. (Score:2)
BTW, my pseudo-values for the high-order nybbles follows from the zone punches that were overpunched. The top row was the "Y" zone, then came the "X" zone, and then the "zero" zone.
Most Windows programmers would switch if... (Score:1)
One of the few things that Microsoft actually did reasonably well was to build Unicode support into Windows NT. It's possible a program where your char is a unicode char without too much trouble. Unfortunately, such a program will not run on Windows 95/98, which do not offer much in Unicode support.
So...
Once Windows 9x dies, Unicode will become vastly more prevalent, at least on M$ platforms.
Re:ASCII to EBCDIC conversion (Score:1)
Guy asks about conversion between ASCII and EBCDIC (chart/table), so I put one up... off-topic my arse!!!
[/bitch-and-moan]
Re:Never kill ASCII, PERIOD! (Score:1)
I'd never use an OS that wasn't ascii-based at the lowest (character) level.
---
script-fu: hash bang slash bin bash