?? unicode.php
字號(hào):
<?php/******************************************************************************** Filename: Unicode.php** Description: Provides functions for handling Unicode strings in PHP without* needing to configure the non-default mbstring extension** Author: Evan Hunter** Date: 27/7/2004** Project: JPEG Metadata** Revision: 1.10** Changes: 1.00 -> 1.10 : Added the following functions:* smart_HTML_Entities* smart_htmlspecialchars* HTML_UTF16_UnEscape* HTML_UTF8_UnEscape* changed HTML_UTF8_Escape and HTML_UTF16_Escape to* use smart_htmlspecialchars, so that characters which* were already escaped would remain intact*** URL: http://electronics.ozhiker.com** License: This file is part of the PHP JPEG Metadata Toolkit.** The PHP JPEG Metadata Toolkit is free software; you can* redistribute it and/or modify it under the terms of the* GNU General Public License as published by the Free Software* Foundation; either version 2 of the License, or (at your* option) any later version.** The PHP JPEG Metadata Toolkit is distributed in the hope* that it will be useful, but WITHOUT ANY WARRANTY; without* even the implied warranty of MERCHANTABILITY or FITNESS* FOR A PARTICULAR PURPOSE. See the GNU General Public License* for more details.** You should have received a copy of the GNU General Public* License along with the PHP JPEG Metadata Toolkit; if not,* write to the Free Software Foundation, Inc., 59 Temple* Place, Suite 330, Boston, MA 02111-1307 USA** If you require a different license for commercial or other* purposes, please contact the author: evan@ozhiker.com*******************************************************************************/// TODO: UTF-16 functions have not been tested fully/******************************************************************************** Unicode UTF-8 Encoding Functions** Description: UTF-8 is a Unicode encoding system in which extended characters* use only the upper half (128 values) of the byte range, thus it* allows the use of normal 7-bit ASCII text.* 7-Bit ASCII will pass straight through UTF-8 encoding/decoding without change*** The encoding is as follows:* Unicode Value : Binary representation (x=data bit)*--------------------------------------------------------------------------------* U-00000000 - U-0000007F: 0xxxxxxx <- This is 7-bit ASCII* U-00000080 - U-000007FF: 110xxxxx 10xxxxxx* U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx* U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx* U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx* U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx*--------------------------------------------------------------------------------*******************************************************************************//******************************************************************************** Unicode UTF-16 Encoding Functions** Description: UTF-16 is a Unicode encoding system uses 16 bit values for representing* characters.* It also has an extended set of characters available by the use* of surrogate pairs, which are a pair of 16 bit values, giving a* total data length of 20 useful bits.*** The encoding is as follows:* Unicode Value : Binary representation (x=data bit)*--------------------------------------------------------------------------------* U-000000 - U-00D7FF: xxxxxxxx xxxxxxxx* U-00D800 - U-00DBFF: Not available - used for high surrogate pairs* U-00DC00 - U-00DFFF: Not available - used for low surrogate pairs U-00E000 - U-00FFFF: xxxxxxxx xxxxxxxx* U-010000 - U-10FFFF: 110110ww wwxxxxxx 110111xx xxxxxxxx ( wwww = (uni-0x10000)/0x10000 )*--------------------------------------------------------------------------------** Surrogate pair Calculations** $hi = ($uni - 0x10000) / 0x400 + 0xD800;* $lo = ($uni - 0x10000) % 0x400 + 0xDC00;*** $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);********************************************************************************//******************************************************************************** Function: UTF8_fix** Description: Checks a string for badly formed Unicode UTF-8 coding and* returns the same string containing only the parts which* were properly formed UTF-8 data.** Parameters: utf8_text - a string with possibly badly formed UTF-8 data** Returns: output - the well formed UTF-8 version of the string*******************************************************************************/function UTF8_fix( $utf8_text ){ // Initialise the current position in the string $pos = 0; // Create a string to accept the well formed output $output = "" ; // Cycle through each group of bytes, ensuring the coding is correct while ( $pos < strlen( $utf8_text ) ) { // Retreive the current numerical character value $chval = ord($utf8_text{$pos}); // Check what the first character is - it will tell us how many bytes the // Unicode value covers if ( ( $chval >= 0x00 ) && ( $chval <= 0x7F ) ) { // 1 Byte UTF-8 Unicode (7-Bit ASCII) Character $bytes = 1; } else if ( ( $chval >= 0xC0 ) && ( $chval <= 0xDF ) ) { // 2 Byte UTF-8 Unicode Character $bytes = 2; } else if ( ( $chval >= 0xE0 ) && ( $chval <= 0xEF ) ) { // 3 Byte UTF-8 Unicode Character $bytes = 3; } else if ( ( $chval >= 0xF0 ) && ( $chval <= 0xF7 ) ) { // 4 Byte UTF-8 Unicode Character $bytes = 4; } else if ( ( $chval >= 0xF8 ) && ( $chval <= 0xFB ) ) { // 5 Byte UTF-8 Unicode Character $bytes = 5; } else if ( ( $chval >= 0xFC ) && ( $chval <= 0xFD ) ) { // 6 Byte UTF-8 Unicode Character $bytes = 6; } else { // Invalid Code - skip character and do nothing $bytes = 0; $pos++; } // check that there is enough data remaining to read if (($pos + $bytes - 1) < strlen( $utf8_text ) ) { // Cycle through the number of bytes specified, // copying them to the output string while ( $bytes > 0 ) { $output .= $utf8_text{$pos}; $pos++; $bytes--; } } else { break; } } // Return the result return $output;}/******************************************************************************* End of Function: UTF8_fix******************************************************************************//******************************************************************************** Function: UTF16_fix** Description: Checks a string for badly formed Unicode UTF-16 coding and* returns the same string containing only the parts which* were properly formed UTF-16 data.** Parameters: utf16_text - a string with possibly badly formed UTF-16 data* MSB_first - True will cause processing as Big Endian UTF-16 (Motorola, MSB first)* False will cause processing as Little Endian UTF-16 (Intel, LSB first)** Returns: output - the well formed UTF-16 version of the string*******************************************************************************/function UTF16_fix( $utf16_text, $MSB_first ){ // Initialise the current position in the string $pos = 0; // Create a string to accept the well formed output $output = "" ; // Cycle through each group of bytes, ensuring the coding is correct while ( $pos < strlen( $utf16_text ) ) { // Retreive the current numerical character value $chval1 = ord($utf16_text{$pos}); // Skip over character just read $pos++; // Check if there is another character available if ( $pos < strlen( $utf16_text ) ) { // Another character is available - get it for the second half of the UTF-16 value $chval2 = ord( $utf16_text{$pos} ); } else { // Error - no second byte to this UTF-16 value - end processing continue 1; } // Skip over character just read $pos++; // Calculate the 16 bit unicode value if ( $MSB_first ) { // Big Endian $UTF16_val = $chval1 * 0x100 + $chval2; } else { // Little Endian $UTF16_val = $chval2 * 0x100 + $chval1; } if ( ( ( $UTF16_val >= 0x0000 ) && ( $UTF16_val <= 0xD7FF ) ) || ( ( $UTF16_val >= 0xE000 ) && ( $UTF16_val <= 0xFFFF ) ) ) { // Normal Character (Non Surrogate pair) // Add it to the output $output .= chr( $chval1 ) . chr ( $chval2 ); } else if ( ( $UTF16_val >= 0xD800 ) && ( $UTF16_val <= 0xDBFF ) ) { // High surrogate of a surrogate pair // Now we need to read the low surrogate // Check if there is another 2 characters available if ( ( $pos + 3 ) < strlen( $utf16_text ) ) { // Another 2 characters are available - get them $chval3 = ord( $utf16_text{$pos} ); $chval4 = ord( $utf16_text{$pos+1} ); // Calculate the second 16 bit unicode value if ( $MSB_first ) { // Big Endian $UTF16_val2 = $chval3 * 0x100 + $chval4; } else
?? 快捷鍵說(shuō)明
復(fù)制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號(hào)
Ctrl + =
減小字號(hào)
Ctrl + -