Battery in Semi Reading 15 When It's Supposed to Be Lower
Always wonder about that mysterious Content-Type tag? You know, the one you lot're supposed to put in HTML and you never quite know what information technology should be?
Did you lot ever get an e-mail from your friends in Bulgaria with the field of study line "???? ?????? ??? ????"?
I've been dismayed to observe but how many software developers aren't really completely upward to speed on the mysterious globe of graphic symbol sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no thought. When I looked closely at the commercial ActiveX control we were using to parse MIME email letters, we discovered information technology was doing exactly the incorrect thing with character sets, so we actually had to write heroic code to disengage the wrong conversion it had washed and redo it correctly. When I looked into another commercial library, it, too, had a completely cleaved character code implementation. I corresponded with the developer of that package and he sort of thought they "couldn't practise anything about it." Similar many programmers, he but wished it would all blow over somehow.
Just it won't. When I discovered that the popular web development tool PHP has nearly complete ignorance of grapheme encoding bug, blithely using 8 bits for characters, making it darn near impossible to develop adept international web applications, I thought, plenty is enough.
So I have an annunciation to make: if you are a programmer working in 2003 and you don't know the basics of characters, graphic symbol sets, encodings, and Unicode, and I catch y'all, I'yard going to punish you by making you peel onions for vi months in a submarine. I swear I volition.
And one more than thing:
IT'S NOT THAT Hard.
In this article I'll fill y'all in on exactly what every working programmer should know. All that stuff about "plain text = ascii = characters are 8 bits" is not only incorrect, information technology'southward hopelessly wrong, and if you lot're all the same programming that way, yous're not much better than a medical doctor who doesn't believe in germs. Delight practise non write another line of lawmaking until y'all finish reading this article.
Before I get started, I should warn you that if you are one of those rare people who knows almost internationalization, y'all are going to find my entire word a piddling chip oversimplified. I'chiliad really just trying to set a minimum bar here so that everyone tin empathize what's going on and tin write lawmaking that has a hope of working with text in any language other than the subset of English that doesn't include words with accents. And I should warn you that character handling is only a tiny portion of what it takes to create software that works internationally, simply I can but write about one thing at a time then today information technology's character sets.
A Historical Perspective
The easiest style to empathise this stuff is to go chronologically.
You probably think I'm going to talk almost very old graphic symbol sets similar EBCDIC here. Well, I won't. EBCDIC is not relevant to your life. We don't have to become that far back in time.
Back in the semi-olden days, when Unix was being invented and K&R were writing The C Programming Linguistic communication, everything was very simple. EBCDIC was on its way out. The only characters that mattered were good old unaccented English messages, and we had a code for them chosen ASCII which was able to represent every character using a number between 32 and 127. Infinite was 32, the alphabetic character "A" was 65, etc. This could conveniently be stored in seven $.25. Most computers in those days were using viii-bit bytes, so non simply could yous store every possible ASCII graphic symbol, simply you had a whole bit to spare, which, if yous were wicked, you could employ for your own devious purposes: the dim bulbs at WordStar actually turned on the loftier chip to indicate the terminal letter in a give-and-take, condemning WordStar to English language text but. Codes below 32 were called unprintable and were used for cussing. But kidding. They were used for control characters, like 7 which made your computer beep and 12 which caused the electric current page of paper to become flying out of the printer and a new ane to be fed in.
And all was practiced, assuming you were an English speaker.
Because bytes accept room for upwards to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our ain purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the infinite from 128 to 255. The IBM-PC had something that came to exist known as the OEM grapheme set which provided some accented characters for European languages and a bunch of line drawing characters… horizontal bars, vertical bars, horizontal bars with fiddling dingle-dangles dangling off the right side, etc., and you could use these line cartoon characters to make spiffy boxes and lines on the screen, which y'all tin still see running on the 8088 calculator at your dry cleaners'. In fact equally soon as people started buying PCs exterior of America all kinds of different OEM character sets were dreamed up, which all used the peak 128 characters for their own purposes. For case on some PCs the grapheme code 130 would brandish as é, but on computers sold in Israel information technology was the Hebrew alphabetic character Gimel (
), so when Americans would send their résumés to Israel they would arrive as r
sum
s. In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, then you lot couldn't even reliably interchange Russian documents.
Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to practise below 128, which was pretty much the same every bit ASCII, merely there were lots of different ways to handle the characters from 128 and on upwards, depending on where y'all lived. These different systems were called code pages. So for example in State of israel DOS used a code folio called 862, while Greek users used 737. They were the aforementioned below 128 simply different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, treatment everything from English to Icelandic and they even had a few "multilingual" code pages that could do Esperanto and Galician on the same computer! Wow! Only getting, say, Hebrew and Greek on the same computer was a consummate impossibility unless you lot wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required unlike code pages with different interpretations of the high numbers.
Meanwhile, in Asia, even more than crazy things were going on to take into account the fact that Asian alphabets take thousands of messages, which were never going to fit into eight bits. This was usually solved by the messy organization chosen DBCS, the "double byte character gear up" in which some letters were stored in i byte and others took two. It was easy to move frontward in a string, but dang about impossible to motion backwards. Programmers were encouraged not to use s++ and s– to move backwards and forwards, merely instead to call functions such as Windows' AnsiNext and AnsiPrev which knew how to deal with the whole mess.
But still, most people only pretended that a byte was a character and a character was 8 bits and every bit long as yous never moved a cord from one figurer to another, or spoke more than than ane language, it would sort of always work. But of course, equally before long as the Internet happened, it became quite commonplace to move strings from one calculator to some other, and the whole mess came tumbling down. Luckily, Unicode had been invented.
Unicode
Unicode was a dauntless endeavor to create a single character set that included every reasonable writing organization on the planet and some make-believe ones similar Klingon, too. Some people are nether the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, really, correct. Information technology is the single most mutual myth about Unicode, so if you idea that, don't experience bad.
In fact, Unicode has a dissimilar mode of thinking well-nigh characters, and you have to empathise the Unicode manner of thinking of things or nothing will make sense.
Until now, nosotros've assumed that a letter maps to some $.25 which you can store on disk or in memory:
A -> 0100 0001
In Unicode, a letter maps to something called a code betoken which is still just a theoretical concept. How that code signal is represented in memory or on deejay is a whole nuther story.
In Unicode, the letter A is a platonic ideal. It's but floating in sky:
A
This platonic A is different than B, and unlike from a, only the same equally A and A and A. The idea that A in a Times New Roman font is the same character every bit the A in a Helvetica font, but different from "a" in lower case, does not seem very controversial, merely in some languages merely figuring out what a letter is can cause controversy. Is the German letter of the alphabet ß a real letter of the alphabet or just a fancy way of writing ss? If a alphabetic character's shape changes at the end of the discussion, is that a different alphabetic character? Hebrew says yep, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a groovy deal of highly political debate, and you lot don't have to worry about it. They've figured it all out already.
Every platonic alphabetic character in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. You can find them all using the charmap utility on Windows 2000/XP or visiting the Unicode spider web site.
There is no real limit on the number of letters that Unicode tin can define and in fact they have gone beyond 65,536 and so not every unicode letter tin can actually exist squeezed into two bytes, simply that was a myth anyway.
OK, so say nosotros have a string:
Hello
which, in Unicode, corresponds to these five code points:
U+0048 U+0065 U+006C U+006C U+006F.
Just a agglomeration of code points. Numbers, really. We haven't yet said anything about how to shop this in memory or represent it in an email message.
Encodings
That's where encodings come in.
The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let'due south simply shop those numbers in ii bytes each. So Hello becomes
00 48 00 65 00 6C 00 6C 00 6F
Right? Not so fast! Couldn't information technology likewise be:
48 00 65 00 6C 00 6C 00 6F 00 ?
Well, technically, yes, I do believe information technology could, and, in fact, early on implementors wanted to be able to shop their Unicode lawmaking points in high-endian or low-endian mode, whichever their item CPU was fastest at, and lo, it was evening and it was morning and in that location were already ii means to store Unicode. So the people were forced to come upwardly with the baroque convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Marking and if you lot are swapping your high and low bytes it will wait like a FF Iron and the person reading your string will know that they accept to swap every other byte. Phew. Not every Unicode string in the wild has a byte club mark at the get-go.
For a while it seemed like that might be good enough, but programmers were complaining. "Await at all those zeros!" they said, since they were Americans and they were looking at English text which rarely used code points to a higher place U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn't accept minded guzzling twice the number of bytes. But those Californian wimps couldn't behave the idea of doubling the corporeality of storage it took for strings, and anyway, there were already all these doggone documents out there using various ANSI and DBCS graphic symbol sets and who's going to convert them all? Moi? For this reason lonely most people decided to ignore Unicode for several years and in the meantime things got worse.
Thus was invented the brilliant concept of UTF-8. UTF-8 was some other arrangement for storing your string of Unicode code points, those magic U+ numbers, in memory using eight bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using two, 3, in fact, up to vi bytes.
This has the dandy side effect that English text looks exactly the same in UTF-8 equally it did in ASCII, so Americans don't even observe anything wrong. Only the rest of the world has to bound through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the aforementioned equally it was stored in ASCII, and ANSI, and every OEM character set on the planet. At present, if you lot are so assuming every bit to apply accented messages or Greek letters or Klingon letters, you'll accept to employ several bytes to store a single code point, but the Americans will never detect. (UTF-eight also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the zero-terminator will not truncate strings).
So far I've told you lot three means of encoding Unicode. The traditional store-it-in-two-byte methods are chosen UCS-two (because it has two bytes) or UTF-16 (considering information technology has 16 bits), and y'all still have to figure out if it'southward high-endian UCS-ii or low-endian UCS-two. And there's the popular new UTF-8 standard which has the squeamish property of likewise working respectably if you take the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.
There are actually a bunch of other means of encoding Unicode. There'due south something chosen UTF-7, which is a lot like UTF-eight but guarantees that the loftier bit will ever be zero, and then that if you accept to pass Unicode through some kind of draconian police-country email system that thinks 7 bits are quite enough, give thanks you it can still squeeze through unscathed. There'southward UCS-four, which stores each lawmaking bespeak in four bytes, which has the overnice holding that every single code point can be stored in the aforementioned number of bytes, only, golly, even the Texans wouldn't exist so bold equally to waste that much retention.
And in fact now that you're thinking of things in terms of ideal platonic messages which are represented by Unicode lawmaking points, those unicode code points can be encoded in any former-school encoding scheme, too! For case, you lot could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or whatsoever of several hundred encodings that accept been invented and so far, with one catch: some of the letters might not testify upwardly! If there's no equivalent for the Unicode lawmaking point you're trying to represent in the encoding you're trying to represent it in, yous usually become a little question mark: ? or, if you're really good, a box. Which did you get? -> �
There are hundreds of traditional encodings which can merely store some code points correctly and modify all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (as well useful for any Western European language). Merely try to store Russian or Hebrew messages in these encodings and you get a bunch of question marks. UTF 7, 8, sixteen, and 32 all have the prissy property of existence able to store whatever lawmaking point correctly.
The Single Most Important Fact Nigh Encodings
If you completely forget everything I just explained, please remember one extremely of import fact. It does not make sense to accept a cord without knowing what encoding information technology uses. You tin no longer stick your head in the sand and pretend that "patently" text is ASCII.
There Ain't No Such Thing As Manifestly Text.
If you take a string, in retentiveness, in a file, or in an email message, you accept to know what encoding information technology is in or yous cannot interpret it or display it to users correctly.
Almost every stupid "my website looks similar gibberish" or "she can't read my emails when I use accents" problem comes downwards to one naive programmer who didn't understand the simple fact that if y'all don't tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin one) or Windows 1252 (Western European), you simply cannot display it correctly or fifty-fifty figure out where it ends. There are over a hundred encodings and above code signal 127, all bets are off.
How do we preserve this data virtually what encoding a string uses? Well, in that location are standard ways to do this. For an electronic mail bulletin, you are expected to have a string in the header of the class
Content-Type: text/plainly; charset="UTF-8"
For a web page, the original idea was that the web server would return a similar Content-Blazon http header along with the web folio itself — not in the HTML itself, but as one of the response headers that are sent earlier the HTML folio.
This causes issues. Suppose you lot accept a big spider web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using any encoding their copy of Microsoft FrontPage saw fit to generate. The spider web server itself wouldn't actually know what encoding each file was written in, so it couldn't transport the Content-Type header.
It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this collection purists crazy… how can you read the HTML file until you know what encoding it'southward in?! Luckily, well-nigh every encoding in common use does the same matter with characters betwixt 32 and 127, so you can ever get this far on the HTML page without starting to utilize funny letters:
<html>
<caput>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But that meta tag really has to exist the very first thing in the <head> section because equally presently as the web browser sees this tag it'due south going to stop parsing the page and kickoff over after reinterpreting the whole page using the encoding you lot specified.
What do web browsers do if they don't detect whatsoever Content-Blazon, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of diverse languages, what language and encoding was used. Considering the diverse quondam 8 bit code pages tended to put their national letters in unlike ranges between 128 and 255, and considering every homo language has a unlike characteristic histogram of letter of the alphabet usage, this actually has a chance of working. It's truly weird, just information technology does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header wait at their page in a spider web browser and it looks ok, until 1 day, they write something that doesn't exactly adapt to the alphabetic character-frequency-distribution of their native language, and Internet Explorer decides it's Korean and displays it thusly, proving, I think, the indicate that Postel's Police about being "conservative in what you emit and liberal in what you have" is quite frankly not a skillful engineering principle. Anyway, what does the poor reader of this website, which was written in Bulgarian only appears to exist Korean (and not even cohesive Korean), do? He uses the View | Encoding carte du jour and tries a agglomeration of different encodings (there are at least a dozen for Eastern European languages) until the picture comes in clearer. If he knew to exercise that, which most people don't.
For the latest version of CityDesk, the web site management software published by my company, we decided to do everything internally in UCS-two (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP apply as their native string type. In C++ code we merely declare strings every bit wchar_t ("wide char") instead of char and apply the wcs functions instead of the str functions (for case wcscat and wcslen instead of strcat and strlen). To create a literal UCS-2 string in C code you only put an L earlier information technology as so: L"Howdy".
When CityDesk publishes the spider web folio, it converts it to UTF-eight encoding, which has been well supported by web browsers for many years. That's the way all 29 linguistic communication versions of Joel on Software are encoded and I take not yet heard a unmarried person who has had any trouble viewing them.
This article is getting rather long, and I can't peradventure cover everything there is to know about grapheme encodings and Unicode, but I hope that if you've read this far, you know plenty to go back to programming, using antibiotics instead of leeches and spells, a task to which I volition go out yous now.
Source: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
0 Response to "Battery in Semi Reading 15 When It's Supposed to Be Lower"
Publicar un comentario