?? library_6.html
字號:
<!-- This HTML file has been created by texi2html 1.27
from library.texinfo on 3 March 1994 -->
<TITLE>The GNU C Library - Extended Characters</TITLE>
<P>Go to the <A HREF="library_5.html" tppabs="http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_5.html">previous</A>, <A HREF="library_7.html" tppabs="http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_7.html">next</A> section.<P>
<H1><A NAME="SEC66" HREF="library_toc.html#SEC66" tppabs="http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_toc.html#SEC66">Extended Characters</A></H1>
<P>
A number of languages use character sets that are larger than the range
of values of type <CODE>char</CODE>. Japanese and Chinese are probably the
most familiar examples.
<P>
The GNU C library includes support for two mechanisms for dealing with
extended character sets: multibyte characters and wide characters. This
chapter describes how to use these mechanisms, and the functions for
converting between them.
<A NAME="IDX330"></A>
<P>
The behavior of the functions in this chapter is affected by the current
locale for character classification--the <CODE>LC_CTYPE</CODE> category; see
section <A HREF="library_7.html#SEC79" tppabs="http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_7.html#SEC79">Categories of Activities that Locales Affect</A>. This choice of locale selects which multibyte
code is used, and also controls the meanings and characteristics of wide
character codes.
<P>
<H2><A NAME="SEC67" HREF="library_toc.html#SEC67" tppabs="http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_toc.html#SEC67">Introduction to Extended Characters</A></H2>
<P>
You can represent extended characters in either of two ways:
<P>
<UL>
<LI>
As <DFN>Multibyte characters</DFN> which can be embedded in an ordinary
string, an array of <CODE>char</CODE> objects. Their advantage is that many
programs and operating systems can handle occasional multibyte
characters scattered among ordinary ASCII characters, without any
change.
<P>
<A NAME="IDX331"></A>
<LI>
As <DFN>wide characters</DFN>, which are like ordinary characters except that
they occupy more bits. The wide character data type, <CODE>wchar_t</CODE>,
has a range large enough to hold extended character codes as well as
old-fashioned ASCII codes.
<P>
An advantage of wide characters is that each character is a single data
object, just like ordinary ASCII characters. There are a few
disadvantages:
<P>
<UL>
<LI>
Each existing program must be modified and recompiled to make it use
wide characters.
<P>
<LI>
Files of wide characters cannot be read by programs that expect ordinary
characters.
</UL>
</UL>
<P>
Typically, you use the multibyte character representation as part of the
external program interface, such as reading or writing text to files.
However, it's usually easier to perform internal manipulations on
strings containing extended characters on arrays of <CODE>wchar_t</CODE>
objects, since the uniform representation makes most editing operations
easier. If you do use multibyte characters for files and wide
characters for internal operations, you need to convert between them
when you read and write data.
<P>
If your system supports extended characters, then it supports them both
as multibyte characters and as wide characters. The library includes
functions you can use to convert between the two representations.
These functions are described in this chapter.
<P>
<H2><A NAME="SEC68" HREF="library_toc.html#SEC68" tppabs="http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_toc.html#SEC68">Locales and Extended Characters</A></H2>
<P>
A computer system can support more than one multibyte character code,
and more than one wide character code. The user controls the choice of
codes through the current locale for character classification
(see section <A HREF="library_7.html#SEC76" tppabs="http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_7.html#SEC76">Locales and Internationalization</A>). Each locale specifies a particular multibyte
character code and a particular wide character code. The choice of locale
influences the behavior of the conversion functions in the library.
<P>
Some locales support neither wide characters nor nontrivial multibyte
characters. In these locales, the library conversion functions still
work, even though what they do is basically trivial.
<P>
If you select a new locale for character classification, the internal
shift state maintained by these functions can become confused, so it's
not a good idea to change the locale while you are in the middle of
processing a string.
<P>
<A NAME="IDX332"></A>
<H2><A NAME="SEC69" HREF="library_toc.html#SEC69" tppabs="http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_toc.html#SEC69">Multibyte Characters</A></H2>
<P>
In the ordinary ASCII code, a sequence of characters is a sequence of
bytes, and each character is one byte. This is very simple, but
allows for only 256 distinct characters.
<P>
In a <DFN>multibyte character code</DFN>, a sequence of characters is a
sequence of bytes, but each character may occupy one or more consecutive
bytes of the sequence.
<A NAME="IDX333"></A>
<P>
There are many different ways of designing a multibyte character code;
different systems use different codes. To specify a particular code
means designating the <DFN>basic</DFN> byte sequences--those which represent
a single character--and what characters they stand for. A code that a
computer can actually use must have a finite number of these basic
sequences, and typically none of them is more than a few characters
long.
<P>
These sequences need not all have the same length. In fact, many of
them are just one byte long. Because the basic ASCII characters in the
range from <CODE>0</CODE> to <CODE>0177</CODE> are so important, they stand for
themselves in all multibyte character codes. That is to say, a byte
whose value is <CODE>0</CODE> through <CODE>0177</CODE> is always a character in
itself. The characters which are more than one byte must always start
with a byte in the range from <CODE>0200</CODE> through <CODE>0377</CODE>.
<P>
The byte value <CODE>0</CODE> can be used to terminated a string, just as it
is often used in a string of ASCII characters.
<P>
Specifying the basic byte sequences that represent single characters
automatically gives meanings to many longer byte sequences, as more than
one character. For example, if the two byte sequence <CODE>0205 049</CODE>
stands for the Greek letter alpha, then <CODE>0205 049 065</CODE> must stand
for an alpha followed by an <SAMP>`A'</SAMP> (ASCII code 065), and <CODE>0205 049
0205 049</CODE> must stand for two alphas in a row.
<P>
If any byte sequence can have more than one meaning as a sequence of
characters, then the multibyte code is ambiguous--and no good. The
codes that systems actually use are all unambiguous.
<P>
In most codes, there are certain sequences of bytes that have no meaning
as a character or characters. These are called <DFN>invalid</DFN>.
<P>
The simplest possible multibyte code is a trivial one:
<P>
<BLOCKQUOTE>
The basic sequences consist of single bytes.
</BLOCKQUOTE>
<P>
This particular code is equivalent to not using multibyte characters at
all. It has no invalid sequences. But it can handle only 256 different
characters.
<P>
Here is another possible code which can handle 9376 different
characters:
<P>
<BLOCKQUOTE>
The basic sequences consist of
<P>
<UL>
<LI>
single bytes with values in the range <CODE>0</CODE> through <CODE>0237</CODE>.
<P>
<LI>
two-byte sequences, in which both of the bytes have values in the range
from <CODE>0240</CODE> through <CODE>0377</CODE>.
</UL>
</BLOCKQUOTE>
<P>
This code or a similar one is used on some systems to represent Japanese
characters. The invalid sequences are those which consist of an odd
number of consecutive bytes in the range from <CODE>0240</CODE> through
<CODE>0377</CODE>.
<P>
Here is another multibyte code which can handle more distinct extended
characters--in fact, almost thirty million:
<P>
<BLOCKQUOTE>
The basic sequences consist of
<P>
<UL>
<LI>
single bytes with values in the range <CODE>0</CODE> through <CODE>0177</CODE>.
<P>
<LI>
sequences of up to four bytes in which the first byte is in the range
from <CODE>0200</CODE> through <CODE>0237</CODE>, and the remaining bytes are in the
range from <CODE>0240</CODE> through <CODE>0377</CODE>.
</UL>
</BLOCKQUOTE>
<P>
In this code, any sequence that starts with a byte in the range
from <CODE>0240</CODE> through <CODE>0377</CODE> is invalid.
<P>
And here is another variant which has the advantage that removing the
last byte or bytes from a valid character can never produce another
valid character. (This property is convenient when you want to search
strings for particular characters.)
<P>
<BLOCKQUOTE>
The basic sequences consist of
<P>
<UL>
<LI>
single bytes with values in the range <CODE>0</CODE> through <CODE>0177</CODE>.
<P>
<LI>
two-byte sequences in which the first byte is in the range from
<CODE>0200</CODE> through <CODE>0207</CODE>, and the second byte is in the range
from <CODE>0240</CODE> through <CODE>0377</CODE>.
<P>
<LI>
three-byte sequences in which the first byte is in the range from
<CODE>0210</CODE> through <CODE>0217</CODE>, and the other bytes are in the range
from <CODE>0240</CODE> through <CODE>0377</CODE>.
<P>
<LI>
four-byte sequences in which the first byte is in the range from
<CODE>0220</CODE> through <CODE>0227</CODE>, and the other bytes are in the range
from <CODE>0240</CODE> through <CODE>0377</CODE>.
</UL>
</BLOCKQUOTE>
<P>
The list of invalid sequences for this code is long and not worth
stating in full; examples of invalid sequences include <CODE>0240</CODE> and
<CODE>0220 0300 065</CODE>.
<P>
The number of <EM>possible</EM> multibyte codes is astronomical. But a
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -