Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Communications

Journal Journal: Why would you use MS Word to make online posts or comments? 1

The <textarea> box was invented for a reason, people! No matter how formatted you make your content in Word, you're going to end up with near-plaintext in the end anyhow. There is no need to be using an outside editor for your fucking posts and replies, even if you're using it to spell-check. There are several extensions and programs in general that add that functionality to web forms, so your excuse of having bad spelling and/or grammar doesn't fly when it comes to that.

What annoys me most is that you'll see writing symbols such as “these” – or even things like that. You see, in cases where it comes to "proper" quotations (i.e. you don't want to use the neutral quotation characters like I just did), there is the <q> tag. Your angled quotation marks are prepended and appended to the <q> tag's contents. Another misconception is that apostrophes are supposed to be an angled single quotation mark; they're not. Use the damn &apos; HTML entity if you want to get semantic about, or just use ' as it's the proper character.

Edit: the “proper” Unicode characters to be used are U+2018/U+2019 for single quotes and U+201C/U+201D for double quotes. Usage of the grave accent (`) could probably be replaced with U+2018 as well as the acute accent (U+00B4) with U+2019.

Why do I complain about the usage of these characters? Well, that requires an understanding of character sets and the difference between ISO-8859-* character sets and UTF character sets. Unicode so brilliantly decided that the character codes between 128 and 255 should be reserved for some unknown usage (translated: the damn question mark with a black box around it most of the time, or an empty box in other font families). However, crap character sets such as windows-1252 (CP1252), in the probable effort to keep a character code to 8 bits instead of the 16+ bits used by UTF-8, use those codes for many of these aforementioned characters. For example, the ellipsis is U+2026 (html entity would be &#8230;), but most people know it as "Alt+0133" (which would be html entity &#133;). If that were to be a Unicode character, it'd be U+0085. However, with a quick reference to the Unicode character sets, U+0085 is the "control" character: an unprintable character with semantic meaning of some sorts.

So, why does this matter in the least? As I pointed out, Unicode and windows-1252 are incompatible with each other in about half of the latter's specifications. The Unicode solution I've seen when it comes to windows-* is a simple $data =~ s/[\x80-\xFF]/?/g;, and frankly, it annoys the piss out of me. I use the UTF-8 character set by default as the rest of my operating system works that way (even URIs work via UTF-8 regardless of the site's actual character set[s] used), but this quickly turns to annoyance when sites such as, you guessed it, these forums, don't define the character set used in its content. Even with something as hacked together as vBulletin can be easily editted nearly anywhere to add the line "header('Content-Type: text/html; charset=windows-1252');" to avoid confusion when renderring the page. In fact, one can easily add in a <meta /> tag like "<meta http-equiv='Content-Type' content='text/html; charset=windows-1252' />" to the <head>ers, so availability of programming languages is not an issue. In XML, it's as simple as adding the "encoding='utf-8'" attribute to the <?xml version='1.0'?> shebang.

In short, there are two things everyone should do.

  1. DON'T use Microsoft Word or any word processing programs to write any internet content. It has auto-corrections that will turn common characters such as ", ', ..., -, et al., into a windows-1252 equivalent, so when you copy/paste it into a web form, you're also chancing whether or not the web developer will:
  2. DO specify a page's character set. If you're too lazy to actual convert/store user-submitted data into a character set such as UTF-8, then specify your character set as windows-1252 as it's up there in the top three character sets in use (US-ASCII, the entire ISO-8859 family + incompatible derivatives, Unicode/UTF-8). Otherwise, you'll also need to add the accept-charset="charset" attribute to your <form>s as well as the "Content-Type: text/html; charset=charset" header. Due mainly to the fact that web browsers seem to submit everything in windows-1252 by default regardless of the page's character set, you'll need to add in that attribute to your <form>s. If you have output buffering (or equivalent) enabled, you can simply rewrite all instances of the <form> tag to include the accept-charset attribute to save yourself the time of manually editting all forms if you have them hard-coded in the first place.

This was originally posted here (Something Awful archives account required).

Slashdot Top Deals

The rule on staying alive as a program manager is to give 'em a number or give 'em a date, but never give 'em both at once.

Working...