文字

html_entity_decode

(PHP 4 >= 4.3.0, PHP 5)

html_entity_decodeConvert all HTML entities to their applicable characters

说明

string html_entity_decode ( string $string [, int $flags = ENT_COMPAT | ENT_HTML401 [, string $encoding = ini_get("default_charset") ]] )

html_entity_decode() is the opposite of htmlentities() in that it converts all HTML entities in the string to their applicable characters.

More precisely, this function decodes all the entities (including all numeric entities) that a) are necessarily valid for the chosen document type — i.e., for XML, this function does not decode named entities that might be defined in some DTD — and b) whose character or characters are in the coded character set associated with the chosen encoding and are permitted in the chosen document type. All other entities are left as is.

参数

string

The input string.

flags

A bitmask of one or more of the following flags, which specify how to handle quotes and which document type to use. The default is ENT_COMPAT | ENT_HTML401.

Available flags constants
Constant Name Description
ENT_COMPAT Will convert double-quotes and leave single-quotes alone.
ENT_QUOTES Will convert both double and single quotes.
ENT_NOQUOTES Will leave both double and single quotes unconverted.
ENT_HTML401 Handle code as HTML 4.01.
ENT_XML1 Handle code as XML 1.
ENT_XHTML Handle code as XHTML.
ENT_HTML5 Handle code as HTML 5.
encoding

An optional argument defining the encoding used when converting characters.

If omitted, the default value of the encoding varies depending on the PHP version in use. In PHP 5.6 and later, the default_charset configuration option is used as the default value. PHP 5.4 and 5.5 will use UTF-8 as the default. Earlier versions of PHP use ISO-8859-1.

Although this argument is technically optional, you are highly encouraged to specify the correct value for your code if you are using PHP 5.5 or earlier, or if your default_charset configuration option may be set incorrectly for the given input.

支持以下字符集:

支持的字符集列表
字符集 别名 描述
ISO-8859-1 ISO8859-1 西欧,Latin-1
ISO-8859-5 ISO8859-5 Little used cyrillic charset (Latin/Cyrillic).
ISO-8859-15 ISO8859-15 西欧,Latin-9。增加欧元符号,法语和芬兰语字母在 Latin-1(ISO-8859-1) 中缺失。
UTF-8   ASCII 兼容的多字节 8 位 Unicode。
cp866 ibm866, 866 DOS 特有的西里尔编码。本字符集在 4.3.2 版本中得到支持。
cp1251 Windows-1251, win-1251, 1251 Windows 特有的西里尔编码。本字符集在 4.3.2 版本中得到支持。
cp1252 Windows-1252, 1252 Windows 特有的西欧编码。
KOI8-R koi8-ru, koi8r 俄语。本字符集在 4.3.2 版本中得到支持。
BIG5 950 繁体中文,主要用于中国台湾省。
GB2312 936 简体中文,中国国家标准字符集。
BIG5-HKSCS   繁体中文,附带香港扩展的 Big5 字符集。
Shift_JIS SJIS, 932 日语
EUC-JP EUCJP 日语
MacRoman   Mac OS 使用的字符串。
''   An empty string activates detection from script encoding (Zend multibyte), default_charset and current locale (see nl_langinfo() and setlocale() ), in this order. Not recommended.

Note: 其他字符集没有认可。将会使用默认编码并抛出异常。

返回值

Returns the decoded string.

更新日志

版本 说明
5.6.0 The default value for the encoding parameter was changed to be the value of the default_charset configuration option.
5.4.0 Default encoding changed from ISO-8859-1 to UTF-8.
5.4.0 The constants ENT_HTML401 , ENT_XML1 , ENT_XHTML and ENT_HTML5 were added.

范例

Example #1 Decoding HTML entities

<?php
$orig 
"I'll \"walk\" the <b>dog</b> now" ;

$a  htmlentities ( $orig );

$b  html_entity_decode ( $a );

echo 
$a // I'll &quot;walk&quot; the &lt;b&gt;dog&lt;/b&gt; now

echo  $b // I'll "walk" the <b>dog</b> now
?>

注释

Note:

You might wonder why trim(html_entity_decode('&nbsp;')); doesn't reduce the string to an empty string, that's because the '&nbsp;' entity is not ASCII code 32 (which is stripped by trim() ) but ASCII code 160 (0xa0) in the default ISO 8859-1 encoding.

参见

  • htmlentities() - Convert all applicable characters to HTML entities
  • htmlspecialchars() - Convert special characters to HTML entities
  • get_html_translation_table() - 返回使用 htmlspecialchars 和 htmlentities 后的转换表
  • urldecode() - 解码已编码的 URL 字符串

用户评论:

[#1] txnull [2015-08-25 08:06:37]

Use the following to decode all entities:
<?php html_entity_decode($string, ENT_QUOTES | ENT_XML1, 'UTF-8') ?>

I've checked these special entities: 
- double quotes (&#34;)
- single quotes (&#39; and &apos;) 
- non printable chars (e.g. &#13;)
With other $flags some or all won't be decoded.

It seems that ENT_XML1 and ENT_XHTML are identical when decoding.

[#2] Benjamin [2013-04-05 11:07:20]

The following function decodes named and numeric HTML entities and works on UTF-8. Requires iconv.

function decodeHtmlEnt($str) {
$ret = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
$p2 = -1;
for(;;) {
$p = strpos($ret, '&#', $p2+1);
if ($p === FALSE)
break;
$p2 = strpos($ret, ';', $p);
if ($p2 === FALSE)
break;

if (substr($ret, $p+2, 1) == 'x')
$char = hexdec(substr($ret, $p+3, $p2-$p-3));
else
$char = intval(substr($ret, $p+2, $p2-$p-2));

//echo "$char\n";
$newchar = iconv(
'UCS-4', 'UTF-8',
chr(($char>>24)&0xFF).chr(($char>>16)&0xFF).chr(($char>>8)&0xFF).chr($char&0xFF) 
);
//echo "$newchar<$p<$p2<<\n";
$ret = substr_replace($ret, $newchar, $p, 1+$p2-$p);
$p2 = $p + strlen($newchar);
}
return $ret;
}

[#3] Victor [2011-12-14 06:22:21]

We were having very peculiar behavior regarding foreign characters such as e-acute.

However, it was only showing up as a problem when extracting those characters out of our mysql database and when being displayed through a proxy server of ours that handles dns issues.

As other users have made a note of, the default character setting wasn't what they were expecting it to be when they left theirs blank.

When we changed our default_charset to "UTF-8", our problems and needs for using functions like these were no longer necessary in handling foreign characters such as e-acute. Good enough for us!

[#4] Martin [2011-06-26 04:37:00]

If you need something that converts &#[0-9]+ entities to UTF-8, this is simple and works:

<?php

echo $output;
?>

[#5] neurotic dot neu at gmail dot com [2010-08-10 12:25:39]

This is a safe rawurldecode with utf8 detection:

<?php
function utf8_rawurldecode($raw_url_encoded){
    
$enc rawurldecode($raw_url_encoded);
    if(
utf8_encode(utf8_decode($enc))==$enc){;
        return 
rawurldecode($raw_url_encoded);
    }else{
        return 
utf8_encode(rawurldecode($raw_url_encoded));
    }
}
?>

[#6] Free at Key dot no [2010-07-01 05:51:46]

Handy function to convert remaining HTML-entities into human readable chars (for entities which do not exist in target charset):

<?php
function cleanString($in,$offset=null)
{
    
$out trim($in);
    if (!empty(
$out))
    {
        
$entity_start strpos($out,'&',$offset);
        if (
$entity_start === false)
        {
            
// ideal
            
return $out;    
        }
        else
        {
            
$entity_end strpos($out,';',$entity_start);
            if (
$entity_end === false)
            {
                 return 
$out;
            }
            
// zu lang um eine entity zu sein
            
else if ($entity_end $entity_start+7)
            {
                 
// und weiter gehts
                 
$out cleanString($out,$entity_start+1);
            }
            
// gottcha!
            
else
            {
                 
$clean substr($out,0,$entity_start);
                 
$subst substr($out,$entity_start+1,1);
                 
// &scaron; => "s" / &#353; => "_"
                 
$clean .= ($subst != "#") ? $subst "_";
                 
$clean .= substr($out,$entity_end+1);
                 
// und weiter gehts
                 
$out cleanString($clean,$entity_start+1);
            }
        }
    }
    return 
$out;
}
?>

[#7] Matt Robinson [2009-09-06 14:11:04]

I wrote in a previous comment that html_entity_decode() only handled about 100 characters. That's not quite true; it only handles entities that exist in the output character set (the third argument). If you want to get ALL HTML entities, make sure you use ENT_QUOTES and set the third argument to 'UTF-8'. 

If you don't want a UTF-8 string, you'll need to convert it afterward with something like utf8_decode(), iconv(), or mb_convert_encoding(). 

If you're producing XML, which doesn't recognise most HTML entities:

When producing a UTF-8 document (the default), then htmlspecialchars(html_entity_decode($string, ENT_QUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8') (because you only need to escape < and > and & unless you're printing inside the XML tags themselves).

Otherwise, either convert all the named entities to numeric ones, or declare the named entities in the document's DTD. The full list of 252 entities can be found in the HTML 4.01 Spec, or you can cut and paste the function from my site (http://inanimatt.com/php-convert-entities.php).

[#8] marion at figmentthinking dot com [2009-03-10 05:11:47]

I just ran into the:
Bug #27626 html_entity_decode bug - cannot yet handle MBCS in html_entity_decode()!

The simple solution if you're still running PHP 4 is to wrap the html_entity_decode() function with the utf8_decode() function.

<?php
$string 
'&nbsp;';
$utf8_encode utf8_encode(html_entity_decode($string));
?>


By default html_entity_decode() returns the ISO-8859-1 character set, and by default utf8_decode()...

http://us.php.net/manual/en/function.utf8-decode.php
"Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1"

[#9] jl dot garcia at gmail dot com [2009-03-04 15:33:37]

I created this function to filter all the text that goes in or comes out of the database.

<?php
function filter_string($string$nohtml=''$save='') {
    if(!empty(
$nohtml)) {
        
$string trim($string);
        if(!empty(
$save)) $string htmlentities(trim($string), ENT_QUOTES'ISO-8859-15');
        else 
$string html_entity_decode($stringENT_QUOTES'ISO-8859-15');
    }
    if(!empty(
$save)) $string mysql_real_escape_string($string);
    else 
$string stripslashes($string);
    return(
$string);
}
?>

[#10] kae at verens dot com [2008-05-09 06:11:07]

the references to 'chr()' in the example unhtmlentities() function should be changed to unichr, using the example unichr() function described in the 'chr' reference (http://php.net/chr).

the reason for this is characters such as &#x20AC; which do not break down into an ASCII number (that's the Euro, by the way).

[#11] me at richardsnazell dot com [2008-01-21 04:19:15]

I had a problem getting the 'TM' trademark symbol to display correctly in an email subject line. Using html_entity_decode() with different charsets didn't work, but directly replacing the entity with it's ASCII equivalent did:

$subject = str_replace('&trade;', chr(153), $subject);

[#12] jojo [2006-11-03 20:27:32]

The decipherment does the character encoded by the escape function of JavaScript. 
When the multi byte is used on the page, it is effective. 

javascript escape('aa????aa') ..... 'aa%u3042%u3042aa'
php  jsEscape_decode('aa%u3042%u3042aa')..'aa????aa'

<?php
function jsEscape_decode($jsEscaped,$outCharCode='SJIS'){
    
$arrMojis explode("%u",$jsEscaped);
    for (
$i 1;$i count($arrMojis);$i++){
        
$c substr($arrMojis[$i],0,4);
        
$cc mb_convert_encoding(pack('H*',$c),$outCharCode,'UTF-16');
        
$arrMojis[$i] = substr_replace($arrMojis[$i],$cc,0,4);
    }
    return 
implode('',$arrMojis);
}
?>

[#13] grvg (at) free (dot) fr [2006-07-29 09:44:44]

Here is the ultimate functions to convert HTML entities to UTF-8?:
The main function is?htmlentities2utf8
Others are helper functions

<?php
function chr_utf8($code)
    {
        if (
$code 0) return false;
        elseif (
$code 128) return chr($code);
        elseif (
$code 160// Remove Windows Illegals Cars
        
{
            if (
$code==128$code=8364;
            elseif (
$code==129$code=160// not affected
            
elseif ($code==130$code=8218;
            elseif (
$code==131$code=402;
            elseif (
$code==132$code=8222;
            elseif (
$code==133$code=8230;
            elseif (
$code==134$code=8224;
            elseif (
$code==135$code=8225;
            elseif (
$code==136$code=710;
            elseif (
$code==137$code=8240;
            elseif (
$code==138$code=352;
            elseif (
$code==139$code=8249;
            elseif (
$code==140$code=338;
            elseif (
$code==141$code=160// not affected
            
elseif ($code==142$code=381;
            elseif (
$code==143$code=160// not affected
            
elseif ($code==144$code=160// not affected
            
elseif ($code==145$code=8216;
            elseif (
$code==146$code=8217;
            elseif (
$code==147$code=8220;
            elseif (
$code==148$code=8221;
            elseif (
$code==149$code=8226;
            elseif (
$code==150$code=8211;
            elseif (
$code==151$code=8212;
            elseif (
$code==152$code=732;
            elseif (
$code==153$code=8482;
            elseif (
$code==154$code=353;
            elseif (
$code==155$code=8250;
            elseif (
$code==156$code=339;
            elseif (
$code==157$code=160// not affected
            
elseif ($code==158$code=382;
            elseif (
$code==159$code=376;
        }
        if (
$code 2048) return chr(192 | ($code >> 6)) . chr(128 | ($code 63));
        elseif (
$code 65536) return chr(224 | ($code >> 12)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code 63));
        else return 
chr(240 | ($code >> 18)) . chr(128 | (($code >> 12) & 63)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code 63));
    }

    
// Callback for preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $str);
    
function html_entity_replace($matches)
    {
        if (
$matches[2])
        {
            return 
chr_utf8(hexdec($matches[3]));
        } elseif (
$matches[1])
        {
            return 
chr_utf8($matches[3]);
        }
        switch (
$matches[3])
        {
            case 
"nbsp": return chr_utf8(160);
            case 
"iexcl": return chr_utf8(161);
            case 
"cent": return chr_utf8(162);
            case 
"pound": return chr_utf8(163);
            case 
"curren": return chr_utf8(164);
            case 
"yen": return chr_utf8(165);
            
//... etc with all named HTML entities
        
}
        return 
false;
    }
    
    function 
htmlentities2utf8 ($string// because of the html_entity_decode() bug with UTF-8
    
{
        
$string preg_replace_callback('~&(#(x?))?([^;]+);~''html_entity_replace'$string);
        return 
$string;
    }
?>

[#14] florianborn (at) yahoo (dot) de [2005-07-20 03:43:11]

Note that

<?php

 
echo urlencode(html_entity_decode("&nbsp;"));

?>


will output "%A0" instead of "+".

[#15] php dot net at c dash ovidiu dot tk [2005-03-18 00:37:29]

Quick & dirty code that translates numeric entities to UTF-8.

<?php

    
function replace_num_entity($ord)
    {
        
$ord $ord[1];
        if (
preg_match('/^x([0-9a-f]+)$/i'$ord$match))
        {
            
$ord hexdec($match[1]);
        }
        else
        {
            
$ord intval($ord);
        }
        
        
$no_bytes 0;
        
$byte = array();

        if (
$ord 128)
        {
            return 
chr($ord);
        }
        elseif (
$ord 2048)
        {
            
$no_bytes 2;
        }
        elseif (
$ord 65536)
        {
            
$no_bytes 3;
        }
        elseif (
$ord 1114112)
        {
            
$no_bytes 4;
        }
        else
        {
            return;
        }

        switch(
$no_bytes)
        {
            case 
2:
            {
                
$prefix = array(31192);
                break;
            }
            case 
3:
            {
                
$prefix = array(15224);
                break;
            }
            case 
4:
            {
                
$prefix = array(7240);
            }
        }

        for (
$i 0$i $no_bytes$i++)
        {
            
$byte[$no_bytes $i 1] = (($ord & (63 pow(2$i))) / pow(2$i)) & 63 128;
        }

        
$byte[0] = ($byte[0] & $prefix[0]) | $prefix[1];

        
$ret '';
        for (
$i 0$i $no_bytes$i++)
        {
            
$ret .= chr($byte[$i]);
        }

        return 
$ret;
    }

    
$test 'This is a &#269;&#x5d0; test&#39;';

    echo 
$test "<br />\n";
    echo 
preg_replace_callback('/&#([0-9a-fx]+);/mi''replace_num_entity'$test);

?>

[#16] daniel at brightbyte dot de [2004-11-13 18:12:00]

This function seems to have to have two limitations (at least in PHP 4.3.8):

a) it does not work with multibyte character codings, such as UTF-8
b) it does not decode numeric entity references

a) can be solved by using iconv to convert to ISO-8859-1, then decoding the entities, than convert to UTF-8 again. But that's quite ugly and detroys all characters not present in Latin-1.

b) can be solved rather nicely using the following code:

<?php
function decode_entities($text) {
    
$texthtml_entity_decode($text,ENT_QUOTES,"ISO-8859-1"); #NOTE: UTF-8 does not work!
    
$textpreg_replace('/&#(\d+);/me',"chr(\\1)",$text); #decimal notation
    
$textpreg_replace('/&#x([a-f0-9]+);/mei',"chr(0x\\1)",$text);  #hex notation
    
return $text;
}
?>


HTH

[#17] aidan at php dot net [2004-09-14 00:57:35]

This functionality is now implemented in the PEAR package PHP_Compat.

More information about using this function without upgrading your version of PHP can be found on the below link:

http://pear.php.net/package/PHP_Compat

上一篇: 下一篇: