 |
utf8_encode (PHP 3 >= 3.0.6, PHP 4, PHP 5) utf8_encode -- 将 ISO-8859-1 编码的字符串转换为 UTF-8 编码 描述string utf8_encode ( string data )
该函数将 data 字符串转换为 UTF-8 编码,并返回编码后的字符串。UTF-8 是一种用于将宽字符值转换为字节流的 Unicode 的标准机制。UTF-8 对于纯 ASCII 字符来说是透明的,且是自同步的(也就是说这使得程序能够得知字符从字节流的何处开始),并可被普通字符串比较函数用以比较等操作。PHP 可将 UTF-8 编码为多达四个字节的字符,如:
表格 1. UTF-8 编码 | 字节(bytes) | 位(bits) | 表 示 |
|---|
| 1 | 7 | 0bbbbbbb | | 2 | 11 | 110bbbbb 10bbbbbb | | 3 | 16 | 1110bbbb 10bbbbbb 10bbbbbb | | 4 | 21 | 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb |
每个 UTF-8 表示一个能被用以储存字符数据的位。
28-Sep-2006 04:30
In reply to Cundle:
Note: The BOM is completely unnecessary in UTF-8. UTF-8 is interpreted the same way regardless of endianness, e.g. (U+039B, GREEK CAPITAL LETTER LAMDA) is represented as the octets 0xCE, 0x9B, always in that order.
Extra note: UTF-16 and UCS-2 are different. The same letter would be encoded as 0x03 0x9B on big-endian (e.g. Motorola) architecture, but 0x9B 0x03 on little-endian (e.g Intel) architecture.
But in any case, there's nothing wrong with putting a BOM at the beginning of a UTF-8 encoded file. It is just treated as U+FEFF Zero Width No-Break Space.
James Cundle
18-Jul-2006 10:33
I had some difficulty finding a way to easily write UTF-8 files with the byte order mark included. This is the simple solution I have come up with:
<?php
function writeUTF8File($filename,$content) {
$dhandle=fopen($filename,"w");
# Now UTF-8 - Add byte order mark
fwrite($dhandle, pack("CCC",0xef,0xbb,0xbf));
fwrite($dhandle,$content);
fclose($dhandle);
}
?>
When you read the file back in using fopen, the BOM will also be there. To remove it, I also wrote the following function:
<?php
function removeBOM($str=""){
if(substr($str, 0,3) == pack("CCC",0xef,0xbb,0xbf)) {
$str=substr($str, 3);
}
return $str;
}
?>
rocketman
16-Mar-2006 08:46
If you are looking for a function to replace special characters with the hex-utf-8 value (e.g. fr Webservice-Security/WSS4J compliancy) you might use this:
$textstart = "Gre";
$utf8 ='';
$max = strlen($txt);
for ($i = 0; $i < $max; $i++) {
if ($txt{i} == "&"){
$neu = "&x26;";
}
elseif ((ord($txt{$i}) < 32) or (ord($txt{$i}) > 127)){
$neu = urlencode(utf8_encode($txt{$i}));
$neu = preg_replace('#\%(..)\%(..)\%(..)#','&#x\1;&#x\2;&#x\3;',$neu);
$neu = preg_replace('#\%(..)\%(..)#','&#x\1;&#x\2;',$neu);
$neu = preg_replace('#\%(..)#','&#x\1;',$neu);
}
else {
$neu = $txt{$i};
}
$utf8 .= $neu;
} // for $i
$textnew = $utf8;
In this example $textnew will be "Größe"
mailing at jcn50 dot com
21-Jan-2006 02:40
I recommend using this alternative for every language:
$new=mb_convert_encoding($s,"UTF-8","auto");
Don't forget to set all your pages to "utf-8" encoding, otherwise just use HTML entities.
jcn50.
migueldiaz at gennio dot com
14-Dec-2005 01:23
Here's my function to know if one string is encoded in UTF8.
If we encode in UTF8 a string or text file that is already encoded in UTF8, it's expected to find the character '' ( ALT+159) in the final string.
<?php
function isUTF8($string)
{
$string_utf8 = utf8_encode($string);
if( strpos($string_utf8,"",0) !== false ) // "" is ALT+159
return true; // the original string was utf8
else
return false; // otherwise
}
?>
regards
Miguel Daz
05-Nov-2005 06:34
// Reads a file story.txt ascii (as typed on keyboard)
// converts it to Georgian character using utf8 encoding
// if I am correct(?) just as it should be when typed on Georgian computer
// it outputs it as an html file
//
// http://www.comweb.nl/keys_to_georgian.html
// http://www.comweb.nl/keys_to_georgian.php
// http://www.comweb.nl/story.txt
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML>
<HEAD>
<TITLE>keys to unicode code</TITLE>
// this meta tag is needed
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >
// note the sylfean font seems to be standard installed on Windows XP
// It supports Georgian
<style TYPE="text/css">
<!--
body {font-family:sylfaen; }
-->
</style>
</HEAD>
<BODY>
<?
$eng=array(97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,
112,113,114,115,116,117,118,119,120,121,122,87,82,84,83,
67,74,90);
$geo=array(4304,4305,4330,4307,4308,4324,4306,4336,4312,4335,4313,
4314,4315,4316,4317,4318,4325,4320,4321,4322,4323,4309,
4332,4334,4327,4310,4333,4326,4311,4328,4329,4319,4331,
91,93,59,39,44,46,96);
$fc=file("story.txt");
foreach($fc as $line)
{
$spacestart=1;
for ($i=0; $i<strlen($line); $i+=1)
{
$character=ord(substr($line,$i,1));
$found=0;
for ($k=0; $k<count($eng); $k+=1)
{
if ($eng[$k]==$character)
{
print code2utf( $geo[$k] );
$found=1;
}
}
if ($found==0)
{
if ($character==126 || $character==32 || $character==10 || $character==9)
{
if ($character==9) { print ' '; }
if ($character==10) { print "<BR>\n"; }
if ($character==32)
{
if ($spacestart==1) {print ' '; } else { print " "; }
}
if ($character==126){ print "~"; }
} else
{
print substr($line,$i,1);
}
}
if ($character!=32) { $spacestart=0; }
}
}
/**
* Function coverts number of utf char into that character.
* Function taken from: http://sk2.php.net/manual/en/function.utf8-encode.php#49336
*
* @param int $num
* @return utf8char
*/
function code2utf($num)
{
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
?>
</BODY>
</HTML>
Janci
04-Nov-2005 08:00
I was searching for a function similar to Javascript's unescape(). In most cases it is OK to use url_decode() function but not if you've got UTF characters in the strings. They are converted into %uXXXX entities that url_decode() cannot handle.
I googled the net and found a function which actualy converts these entities into HTML entities (&#xxx;) that your browser can show correctly. If you're OK with that, the function can be found here: http://pure-essence.net/stuff/code/utf8RawUrlDecode.phps
But it was not OK with me because I needed a string in my charset to make some comparations and other stuff. So I have modified the above function and in conjuction with code2utf() function mentioned in some other note here, I have managed to achieve my goal:
<?php
/**
* Function converts an Javascript escaped string back into a string with specified charset (default is UTF-8).
* Modified function from http://pure-essence.net/stuff/code/utf8RawUrlDecode.phps
*
* @param string $source escaped with Javascript's escape() function
* @param string $iconv_to destination character set will be used as second paramether in the iconv function. Default is UTF-8.
* @return string
*/
function unescape($source, $iconv_to = 'UTF-8') {
$decodedStr = '';
$pos = 0;
$len = strlen ($source);
while ($pos < $len) {
$charAt = substr ($source, $pos, 1);
if ($charAt == '%') {
$pos++;
$charAt = substr ($source, $pos, 1);
if ($charAt == 'u') {
// we got a unicode character
$pos++;
$unicodeHexVal = substr ($source, $pos, 4);
$unicode = hexdec ($unicodeHexVal);
$decodedStr .= code2utf($unicode);
$pos += 4;
}
else {
// we have an escaped ascii character
$hexVal = substr ($source, $pos, 2);
$decodedStr .= chr (hexdec ($hexVal));
$pos += 2;
}
}
else {
$decodedStr .= $charAt;
$pos++;
}
}
if ($iconv_to != "UTF-8") {
$decodedStr = iconv("UTF-8", $iconv_to, $decodedStr);
}
return $decodedStr;
}
/**
* Function coverts number of utf char into that character.
* Function taken from: http://sk2.php.net/manual/en/function.utf8-encode.php#49336
*
* @param int $num
* @return utf8char
*/
function code2utf($num){
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
?>
aktionimskript at gmx dot net
01-Sep-2005 11:52
if you want to put variables as parameter in a flashfile, i prefer using to convert the string with utf8_encode() [or preg_replace, or iconv] and after this i encode it with urlencode();
<?php
$yourstring="yourstring";
$str_utf8=utf8_encode($yourstring);
$str_encoded=urlencode($str_utf8);
echo "<script language='javascript'>";
echo "parameterForFlash='".$str_encoded."';";
echo "</script>";
?>
now you can use the variable (parameterForFlash) in your javascript (plugindetection), that writes the flash object/embed.
suttichai at ceforce dot com
29-May-2005 03:26
This function I use convert Thai font (iso-8859-11) to UTF-8. For my case, It work properly. Please try to use this function if you have a problem to convert charset iso-8859-11 to UTF-8.
function iso8859_11toUTF8($string) {
if ( ! ereg("[\241-\377]", $string) )
return $string;
$iso8859_11 = array(
"\xa1" => "\xe0\xb8\x81",
"\xa2" => "\xe0\xb8\x82",
"\xa3" => "\xe0\xb8\x83",
"\xa4" => "\xe0\xb8\x84",
"\xa5" => "\xe0\xb8\x85",
"\xa6" => "\xe0\xb8\x86",
"\xa7" => "\xe0\xb8\x87",
"\xa8" => "\xe0\xb8\x88",
"\xa9" => "\xe0\xb8\x89",
"\xaa" => "\xe0\xb8\x8a",
"\xab" => "\xe0\xb8\x8b",
"\xac" => "\xe0\xb8\x8c",
"\xad" => "\xe0\xb8\x8d",
"\xae" => "\xe0\xb8\x8e",
"\xaf" => "\xe0\xb8\x8f",
"\xb0" => "\xe0\xb8\x90",
"\xb1" => "\xe0\xb8\x91",
"\xb2" => "\xe0\xb8\x92",
"\xb3" => "\xe0\xb8\x93",
"\xb4" => "\xe0\xb8\x94",
"\xb5" => "\xe0\xb8\x95",
"\xb6" => "\xe0\xb8\x96",
"\xb7" => "\xe0\xb8\x97",
"\xb8" => "\xe0\xb8\x98",
"\xb9" => "\xe0\xb8\x99",
"\xba" => "\xe0\xb8\x9a",
"\xbb" => "\xe0\xb8\x9b",
"\xbc" => "\xe0\xb8\x9c",
"\xbd" => "\xe0\xb8\x9d",
"\xbe" => "\xe0\xb8\x9e",
"\xbf" => "\xe0\xb8\x9f",
"\xc0" => "\xe0\xb8\xa0",
"\xc1" => "\xe0\xb8\xa1",
"\xc2" => "\xe0\xb8\xa2",
"\xc3" => "\xe0\xb8\xa3",
"\xc4" => "\xe0\xb8\xa4",
"\xc5" => "\xe0\xb8\xa5",
"\xc6" => "\xe0\xb8\xa6",
"\xc7" => "\xe0\xb8\xa7",
"\xc8" => "\xe0\xb8\xa8",
"\xc9" => "\xe0\xb8\xa9",
"\xca" => "\xe0\xb8\xaa",
"\xcb" => "\xe0\xb8\xab",
"\xcc" => "\xe0\xb8\xac",
"\xcd" => "\xe0\xb8\xad",
"\xce" => "\xe0\xb8\xae",
"\xcf" => "\xe0\xb8\xaf",
"\xd0" => "\xe0\xb8\xb0",
"\xd1" => "\xe0\xb8\xb1",
"\xd2" => "\xe0\xb8\xb2",
"\xd3" => "\xe0\xb8\xb3",
"\xd4" => "\xe0\xb8\xb4",
"\xd5" => "\xe0\xb8\xb5",
"\xd6" => "\xe0\xb8\xb6",
"\xd7" => "\xe0\xb8\xb7",
"\xd8" => "\xe0\xb8\xb8",
"\xd9" => "\xe0\xb8\xb9",
"\xda" => "\xe0\xb8\xba",
"\xdf" => "\xe0\xb8\xbf",
"\xe0" => "\xe0\xb9\x80",
"\xe1" => "\xe0\xb9\x81",
"\xe2" => "\xe0\xb9\x82",
"\xe3" => "\xe0\xb9\x83",
"\xe4" => "\xe0\xb9\x84",
"\xe5" => "\xe0\xb9\x85",
"\xe6" => "\xe0\xb9\x86",
"\xe7" => "\xe0\xb9\x87",
"\xe8" => "\xe0\xb9\x88",
"\xe9" => "\xe0\xb9\x89",
"\xea" => "\xe0\xb9\x8a",
"\xeb" => "\xe0\xb9\x8b",
"\xec" => "\xe0\xb9\x8c",
"\xed" => "\xe0\xb9\x8d",
"\xee" => "\xe0\xb9\x8e",
"\xef" => "\xe0\xb9\x8f",
"\xf0" => "\xe0\xb9\x90",
"\xf1" => "\xe0\xb9\x91",
"\xf2" => "\xe0\xb9\x92",
"\xf3" => "\xe0\xb9\x93",
"\xf4" => "\xe0\xb9\x94",
"\xf5" => "\xe0\xb9\x95",
"\xf6" => "\xe0\xb9\x96",
"\xf7" => "\xe0\xb9\x97",
"\xf8" => "\xe0\xb9\x98",
"\xf9" => "\xe0\xb9\x99",
"\xfa" => "\xe0\xb9\x9a",
"\xfb" => "\xe0\xb9\x9b"
);
$string=strtr($string,$iso8859_11);
return $string;
}
Suttichai Mesaard-www.ceforce.com
bisqwit at iki dot fi
20-May-2005 04:15
For reference, it may be insightful to point out that:
utf8_encode($s)
is actually identical to:
recode_string('latin1..utf8', $s)
and:
iconv('iso-8859-1', 'utf-8', $s)
That is, utf8_encode is a specialized case of character set conversions.
If your string to be converted to utf-8 is something other than iso-8859-1 (such as iso-8859-2 (Polish/Croatian)), you should use recode_string() or iconv() instead rather than trying to devise complex str_replace statements.
JF Sebastian
09-Apr-2005 06:54
The following Perl regular expression tests if a string is well-formed Unicode UTF-8 (Broken up after each | since long lines are not permitted here. Please join as a single line, no spaces, before use.):
^([\x00-\x7f]|
[\xc2-\xdf][\x80-\xbf]|
\xe0[\xa0-\xbf][\x80-\xbf]|
[\xe1-\xec][\x80-\xbf]{2}|
\xed[\x80-\x9f][\x80-\xbf]|
[\xee-\xef][\x80-\xbf]{2}|
f0[\x90-\xbf][\x80-\xbf]{2}|
[\xf1-\xf3][\x80-\xbf]{3}|
\xf4[\x80-\x8f][\x80-\xbf]{2})*$
NOTE: This strictly follows the Unicode standard 4.0, as described in chapter 3.9, table 3-6, "Well-formed UTF-8 byte sequences" ( http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703 ).
ISO-10646, a super-set of Unicode, uses UTF-8 (there called "UCS", see http://www.unicode.org/faq/utf_bom.html#1 ) in a relaxed variant that supports a 31-bit space encoded into up to six bytes instead of Unicode's 21 bits in up to four bytes. To check for ISO-10646 UTF-8, use the following Perl regular expression (again, broken up, see above):
^([\x00-\x7f]|
[\xc0-\xdf][\x80-\xbf]|
[\xe0-\xef][\x80-\xbf]{2}|
[\xf0-\xf7][\x80-\xbf]{3}|
[\xf8-\xfb][\x80-\xbf]{4}|
[\xfc-\xfd][\x80-\xbf]{5})*$
The following function may be used with above expressions for a quick UTF-8 test, e.g. to distinguish ISO-8859-1-data from UTF-8-data if submitted from a <form accept-charset="utf-8,iso-8859-1" method=..>.
function is_utf8($string) {
return (preg_match('/[insert regular expression here]/', $string) === 1);
}
http://iubito.free.fr
10-Mar-2005 03:57
Here's a function I made to know if one string or textfile is already encoded in UTF8 :
<?php
/**
* Returns <kbd>true</kbd> if the string or array of string is encoded in UTF8.
*
* Example of use. If you want to know if a file is saved in UTF8 format :
* <code> $array = file('one file.txt');
* $isUTF8 = isUTF8($array);
* if (!$isUTF8) --> we need to apply utf8_encode() to be in UTF8
* else --> we are in UTF8 :)
* </code>
* @param mixed A string, or an array from a file() function.
* @return boolean
*/
function isUTF8($string)
{
if (is_array($string))
{
$enc = implode('', $string);
return @!((ord($enc[0]) != 239) && (ord($enc[1]) != 187) && (ord($enc[2]) != 191));
}
else
{
return (utf8_encode(utf8_decode($string)) == $string);
}
}
?>
Denis G.
24-Feb-2005 09:32
Sniplet to convert ASCII coded text to UTF-8:
$text= preg_replace ('/([\x80-\xff])/se', "pack (\"C*\", (ord ($1) >> 6) | 0xc0, (ord ($1) & 0x3f) | 0x80)", $text);
anonymous at anonymous dot com
25-Jan-2005 06:49
A few bugs in your example code:
function code2utf($num){
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
schofei at yahoo dot de
11-Jan-2005 07:23
regarding the above code2utf function...
> romans at void dot lv
> 02-Oct-2002 09:59
> Here is optimized function which converts
> binary UTF symbol code into unicoded string....
Thanks for providing your nice conversion code, however due to some missing parenthesis 4-byte utf-8 chars are not converted properly.
Here is a corrected version of the code2utf function:
function code2utf($num){
if($num<128)return chr($num);
if($num<1024)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<32768)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
regards
Scho Fei
hrpeters (at) gmx (dot) net
14-Dec-2004 02:46
// Validate Unicode UTF-8 Version 4
// This function takes as reference the table 3.6 found at http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
// It also flags overlong bytes as error
function is_validUTF8($str)
{
// values of -1 represent disalloweded values for the first bytes in current UTF-8
static $trailing_bytes = array (
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
);
$ups = unpack('C*', $str);
if (!($aCnt = count($ups))) return true; // Empty string *is* valid UTF-8
for ($i = 1; $i <= $aCnt;)
{
if (!($tbytes = $trailing_bytes[($b1 = $ups[$i++])])) continue;
if ($tbytes == -1) return false;
$first = true;
while ($tbytes > 0 && $i <= $aCnt)
{
$cbyte = $ups[$i++];
if (($cbyte & 0xC0) != 0x80) return false;
if ($first)
{
switch ($b1)
{
case 0xE0:
if ($cbyte < 0xA0) return false;
break;
case 0xED:
if ($cbyte > 0x9F) return false;
break;
case 0xF0:
if ($cbyte < 0x90) return false;
break;
case 0xF4:
if ($cbyte > 0x8F) return false;
break;
default:
break;
}
$first = false;
}
$tbytes--;
}
if ($tbytes) return false; // incomplete sequence at EOS
}
return true;
}
Mark AT modernbill DOT com
10-Nov-2004 03:56
If you haven't guessed already: If the UTF-8 character has no representation in the ISO-8859-1 codepage, a ? will be returned. You might want to wrap a function around this to make sure you aren't saving a bunch of ???? into your database.
Aidan Kehoe <php-manual at parhasard dot net>
30-Aug-2004 10:05
Here's some code that addresses the issue that Steven describes in the previous comment;
<?php
/* This structure encodes the difference between ISO-8859-1 and Windows-1252,
as a map from the UTF-8 encoding of some ISO-8859-1 control characters to
the UTF-8 encoding of the non-control characters that Windows-1252 places
at the equivalent code points. */
$cp1252_map = array(
"\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
"\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
"\xc2\x83" => "\xc6\x92", /* LATIN SMALL LETTER F WITH HOOK */
"\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
"\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
"\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
"\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
"\xc2\x88" => "\xcb\x86", /* MODIFIER LETTER CIRCUMFLEX ACCENT */
"\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
"\xc2\x8a" => "\xc5\xa0", /* LATIN CAPITAL LETTER S WITH CARON */
"\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
"\xc2\x8c" => "\xc5\x92", /* LATIN CAPITAL LIGATURE OE */
"\xc2\x8e" => "\xc5\xbd", /* LATIN CAPITAL LETTER Z WITH CARON */
"\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
"\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
"\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
"\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
"\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
"\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
"\xc2\x97" => "\xe2\x80\x94", /* EM DASH */
"\xc2\x98" => "\xcb\x9c", /* SMALL TILDE */
"\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
"\xc2\x9a" => "\xc5\xa1", /* LATIN SMALL LETTER S WITH CARON */
"\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
"\xc2\x9c" => "\xc5\x93", /* LATIN SMALL LIGATURE OE */
"\xc2\x9e" => "\xc5\xbe", /* LATIN SMALL LETTER Z WITH CARON */
"\xc2\x9f" => "\xc5\xb8" /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
);
function cp1252_to_utf8($str) {
global $cp1252_map;
return strtr(utf8_encode($str), $cp1252_map);
}
?>
steven -at- acko -dot- net
18-Aug-2004 05:45
Note that you should only use utf8_encode() on ISO-8859-1 data, and not on data using the Windows-1252 codepage. Microsoft's Windows-1252 codepage contains ISO-8859-1, but it includes several characters in the range 0x80-0x9F whose codepoints in Unicode do not match the byte's value (in Unicode, codepoints U+80 - U+9F are unassigned).
utf8_encode() simply assumes the bytes integer value is the codepoint number in Unicode.
E.g. in 1252, byte 0x80 is the euro sign, which is U+20AC. The same goes for curly quotes, em dashes, etc.
utf8_encode() will convert 0x80 into U+0080 (an unassigned codepoint) rather than U+20AC.
To convert 1252 to UTF-8, use iconv, recode or mbstring.
Net Raven
25-Jun-2004 03:58
I often need to convert multi language text sent to me for use in websites and other apps into UTF8 encoded so I can insert it into source code and databases.
I knocked up a small web page with its charset set to UTF8 then set it up so I can paste from the original doc (eg word or excel) and have the page return the UTF8 encoded version.
Of course the browser will convert the unicode to UTF8 for you as part of the submit (I use IE5 or better for this) then all you have to do in the PHP is encode the UTF8 so the browser will show it in its raw form.
Its a bit bulky but I just convert ALL character to html numbered entities (brute force and ignorance does it again.)
I've used this to encode everything from Hebrew to Japanese without problems
<?
header("Content-Type: text/plain; charset=utf-8");
$code = (get_magic_quotes_gpc())?stripslashes($GLOBALS[code]):$GLOBALS[code];
?>
<html>
<head>
<title>UTF8 ENCODER PAGE</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<form method=post action="?seed=<?=time()?>">
Original Unicode<br />
<textarea name="code" cols="80" rows="10"><?=$code?></textarea><br />
Encoded UTF8<br />
<textarea name="encd" cols="80" rows="10"><?
for ($i = 0; $i < strlen($code); $i++) {
echo '&#'.ord(substr($code,$i,1));
}
?></textarea><br />
<input type="submit" value="encode">
</form>
</body>
</html>
lorro at lorro dot wigner dot bme dot hu
06-Apr-2004 10:12
Good news is that utf8_encode (like UTF-8) passes '<', '>', '/', '\'', '"', etc., so you are free to utf8_encode complete blocks of html text that includes tags.
Bad news is that UTF-8 is stupid enough so that utf8_encode(utf8_encode($str)) != utf8_encode($str) in most of the cases. What you can do is write utf8_ensure like:
function utf8_ensure($str) {
return seems_utf8($str)? $str: utf8_encode($str);
}
Comes handy when your view library tries to encode the same text multiple times.
bmorel at ssi dot fr
18-Feb-2004 05:22
Here is an improved version of that function, compatible with 31-bit encoding scheme of Unicode 3.x :
<?php
function seems_utf8($Str) {
for ($i=0; $i<strlen($Str); $i++) {
if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # Does not match any model
for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
?>
bmorel at ssi dot fr
16-Feb-2004 04:28
Here is a simple function that can help, if you want to know if a string could be UTF-8 or not :
<?php
function seems_utf8($Str) {
for ($i=0; $i<strlen($Str); $i++) {
if (ord($Str[$i]) < 0x80) $n=0; # 0bbbbbbb
elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif ((ord($Str[$i]) & 0xF0) == 0xF0) $n=3; # 1111bbbb
else return false; # Does not match any model
for ($j=0; $j<$n; $j++) { # n octets that match 10bbbbbb follow ?
if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80)) return false;
}
}
return true;
}
?>
Karen
02-Oct-2003 03:33
Re the previous post about converting GB2312 code to Unicode code which displayed the following function:
<?
// Program by sadly (www.phpx.com)
function gb2unicode($gb)
{
if(!trim($gb))
return $gb;
$filename="gb2312.txt";
$tmp=file($filename);
$codetable=array();
while(list($key,$value)=each($tmp))
$codetable[hexdec(substr($value,0,6))]=substr($value,9,4);
$utf="";
while($gb)
{
if (ord(substr($gb,0,1))>127)
{
$this=substr($gb,0,2);
$gb=substr($gb,2,strlen($gb));
$utf.="&#x".$codetable[hexdec(bin2hex($this))-0x8080].";";
}
else
{
$gb=substr($gb,1,strlen($gb));
$utf.=substr($gb,0,1);
}
}
return $utf;
}
?>
I found that a small change was needed in the code to properly handle latin characters embedded in the middle of gb2312 text, as when the text includes a URL or email address. Just reverse the two lines in the part of the statement above that handles ord vals !>127.
Change:
$gb=substr($gb,1,strlen($gb));
$utf.=substr($gb,0,1);
to:
$utf.=substr($gb,0,1);
$gb=substr($gb,1,strlen($gb));
In the original function, the first latin chacter was dropped and it was not converting the first non-latin character after the latin text (everything was shifted one character too far to the right). Reversing those two lines makes it work correctly in every example I have tried.
Also, the source of the gb2312.txt file needed for this to work has changed. You can find it a couple places:
http://tcl.apache.org/sources/tcl/tools/encoding/gb2312.txt
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT
artem at w510 dot tm dot odessa dot ua
03-Jun-2003 10:10
Loading variables in flash
you can lost a lot of hours if your charset is not iso-88951 and you cant' see your characters in flash
you must use iconv instead with your codepage
(for example windows-1251 for ukrainian, russian)
$fw = fopen("flash_input.txt", "w");
if( $fw )
{
$utf = iconv("windows-1251","UTF-8",$variable_value);
$out = 'variable_name='.$utf;
fputs($fw, $out);
fclose($fw);
}
and no urlecode is needed if you save data in file!
mualem_i at hotmail dot com
22-May-2003 09:12
Hebrew!! What a language. I had some trouble placing the Hebrew in a javascript based drop down menu, the text appeared as UTF8 so I made this function to overcome the problem (Not talking about efficiency)
function rtf_heb($string)
{
$array = split (" ",$string) ;
foreach ($array as $VAL)
{
$VAL = str_replace("א","",$VAL);
$VAL = str_replace("ב","",$VAL);
$VAL = str_replace("ג","",$VAL);
$VAL = str_replace("ד","",$VAL);
$VAL = str_replace("ה","",$VAL);
$VAL = str_replace("ו","",$VAL);
$VAL = str_replace("ז","",$VAL);
$VAL = str_replace("ח","",$VAL);
$VAL = str_replace("ט","",$VAL);
$VAL = str_replace("י","",$VAL);
$VAL = str_replace("כ","",$VAL);
$VAL = str_replace("ל","",$VAL);
$VAL = str_replace("מ","",$VAL);
$VAL = str_replace("נ","",$VAL);
$VAL = str_replace("ס","",$VAL);
$VAL = str_replace("ע","",$VAL);
$VAL = str_replace("פ","",$VAL);
$VAL = str_replace("צ","",$VAL);
$VAL = str_replace("ק","",$VAL);
$VAL = str_replace("ר","",$VAL);
$VAL = str_replace("ש","",$VAL);
$VAL = str_replace("ת","",$VAL);
$VAL = str_replace("ך","",$VAL);
$VAL = str_replace("ף","",$VAL);
$VAL = str_replace("ן","",$VAL);
$VAL = str_replace("ם","",$VAL);
$VAL = str_replace("ץ","",$VAL);
$VAL = str_replace(";","",$VAL);
$send_VAR .= $VAL." ";
}
return $send_VAR;
}
RoyLaw at 263 dot Net
19-May-2003 07:16
There is a function for converting GB2312 code to Unicode code.It maybe useful for programming on XML/WML in non-English lanaguages.
<?
// Program by sadly (www.phpx.com)
function gb2unicode($gb)
{
if(!trim($gb))
return $gb;
$filename="gb2312.txt";
$tmp=file($filename);
$codetable=array();
while(list($key,$value)=each($tmp))
$codetable[hexdec(substr($value,0,6))]=substr($value,9,4);
$utf="";
while($gb)
{
if (ord(substr($gb,0,1))>127)
{
$this=substr($gb,0,2);
$gb=substr($gb,2,strlen($gb));
$utf.="&#x".$codetable[hexdec(bin2hex($this))-0x8080].";";
}
else
{
$gb=substr($gb,1,strlen($gb));
$utf.=substr($gb,0,1);
}
}
return $utf;
}
?>
This function requires a code list of gb2312,you can download it at
ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/GB/GB2312.TXT
sunish_mv at rediffmail dot com
04-Apr-2003 02:50
/*Here I have a class that will convert ISCII (Indian Standard Code for Information Interchange) devnagiri (Hindi) string to unicode string. /*
<?php
class iscii2utf8 {
var $map;
function iscii2utf8() {
$this->map = array (
"a0" => '63' ,
"a1" => '2305' ,
"a2" => '2306' ,
"a3" => '2307' ,
"a4" => '2309' ,
"a5" => '2310' ,
"a6" => '2311' ,
"a7" => '2312' ,
"a8" => '2313' ,
"a9" => '2314' ,
"aa" => '2315' ,
"ab" => '2318' ,
"ac" => '2319' ,
"ad" => '2320' ,
"ae" => '2317' ,
"af" => '2322' ,
"b0" => '2323' ,
"b1" => '2324' ,
"b2" => '2321' ,
"b3" => '2325' ,
"b4" => '2326' ,
"b5" => '2327' ,
"b6" => '2328' ,
"b7" => '2329' ,
"b8" => '2330' ,
"b9" => '2331' ,
"ba" => '2332' ,
"bb" => '2333' ,
"bc" => '2334' ,
"bd" => '2335' ,
"be" => '2336' ,
"bf" => '2337' ,
"c0" => '2338' ,
"c1" => '2339' ,
"c2" => '2340' ,
"c3" => '2341' ,
"c4" => '2342' ,
"c5" => '2343' ,
"c6" => '2344' ,
"c7" => '2345' ,
"c8" => '2346' ,
"c9" => '2347' ,
"ca" => '2348' ,
"cb" => '2349' ,
"cc" => '2350' ,
"cd" => '2351' ,
"ce" => '2399' ,
"cf" => '2352' ,
"d0" => '2353' ,
"d1" => '2354' ,
"d2" => '2355' ,
"d3" => '2356' ,
"d4" => '2357' ,
"d5" => '2358' ,
"d6" => '2359' ,
"d7" => '2360' ,
"d8" => '2361' ,
"d9" => '63' ,
"da" => '2366' ,
"db" => '2367' ,
"dc" => '2368' ,
"dd" => '2369' ,
"de" => '2370' ,
"df" => '2371' ,
"e0" => '2374' ,
"e1" => '2375' ,
"e2" => '2376' ,
"e3" => '2373' ,
"e4" => '2378' ,
"e5" => '2379' ,
"e6" => '2380' ,
"e7" => '2377' ,
"e8" => '2381' ,
"e9" => '63' ,
"ea" => '2404' ,
"eb" => '63' ,
"ec" => '63' ,
"ed" => '63' ,
"ee" => '63' ,
"ef" => '63' ,
"f0" => '63' ,
"f1" => '2406' ,
"f2" => '2407' ,
"f3" => '2408' ,
"f4" => '2409' ,
"f5" => '2410' ,
"f6" => '2411' ,
"f7" => '2412' ,
"f8" => '2413' ,
"f9" => '2414' ,
"fa" => '2415' ,
"fb" => '63' ,
"fc" => '63' ,
"fd" => '63' ,
"fe" => '63' ,
"ff" => '63' ,);
}
function code2utf($num){
//Returns the utf string corresponding to the unicode value
//courtesy - romans@void.lv
if($num<128)return chr($num);
if($num<1024)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<32768)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr($num>>18+240).chr((($num>>12)&63)+128).chr(($num>>6)&63+128). chr($num&63+128);
return '';
}
|