XssHtml – 基于白名单的富文本XSS过滤类

啦啦啦，去了北京参加荣耀6的发布会，真心不错呀这款手机，在这里无耻地推荐一下。与会的同学都获得了一枚荣耀6，说说我的感受吧：CPU真心给力，跑分很高；价格合理，2000是荣耀一贯的高性价比；特权给力，寝室的Chinanet可以免费用了；相机真不错，全景拍照，把整个鸟巢拍得一清二楚，抓拍也很给力，黑屏状态下按两次音量下就能在0.6秒完成一次拍摄；触屏很舒服，滑动没有一丝卡顿。

好了，去北京之前freebuf上投了一篇文章，发到博客里吧。

关于富文本XSS，我在之前的一篇文章里(http://www.freebuf.com/articles/web/30201.html)已经比较详细地说明了一些开源应用使用的XSS Filter以及绕过方法。之前我也总结了一些filter的缺点，利用白名单机制完成了一个XSS Filter类，希望能更大程度地避免富文本XSS的产生。

总结一下现存的一些XSS Filter的缺点，可以归纳成以下几条：

黑名单过滤一些标签，但没有考虑全面。比如<svg>、<object>、<input>等
黑名单过滤一些属性，但没有考虑全面，比如onfocus、onfocusin等
对伪协议考虑不全面，比如<a href=javascript:alert(1)>，有时候只是简单过滤script这种关键词，但总能用字符编码绕过
过滤关键词时过于单纯，比如直接将script过滤为空，导致使用scrscriptipt就能绕过。再比如直接将字符实体转换为原字符，导致使用嵌套的字符实体来绕过。
对IE的特性了解不深，比如expression，中间可以加\，IE7下可以加/**/来绕过。

而一般提供给一般用户使用的富文本编辑器，都是一些很常见功能，比如图片(表情)、超链接、加粗、加斜、字号、字体、颜色、分隔符等，所以我们完全可以用白名单的思想去写一个富文本过滤器，将编辑器中最常用到的一些功能做相应的过滤，其他标签、属性统统丢弃，来达到过滤XSS的效果。

所以我的XssHtml类设计思路是这样：首先用strip_tags清理掉白名单外、不规范的标签，然后用DOMDocument类加载这个HTML进DOM中。遍历DOM，删除白名单外的属性，并强制判断并给非法的href链接前面加入http://。

最后再将过滤完的DOM导出成HTML返回。

这样做有几个好处：

整个类设计简单，只要创建好对象，调用一个方法即可得到过滤结果。
白名单处理，能考虑到所有情况
用PHP自带的DOMDocument类处理html，能有效处理一些不规则的内容。
面向对象类设计，以后想增加其他标签，写针对性的代码可以直接调用之前写好的方法处理。

不过也有一些缺陷，就是过滤XSS不支持IE6及以下浏览器。因为IE6下奇葩特性太多了，会严重影响过滤器的效果与性能，所以我就没有考虑一些IE6的特性。

总的来说这应该是很多不了解安全的程序员的福音了。

类不长，贴出来吧：

<?php
/**
 * PHP 富文本XSS过滤类
 *
 * @package XssHtml
 * @version 1.0.0 
 * @link http://phith0n.github.io/XssHtml
 * @since 20140621
 * @copyright (c) Phithon All Rights Reserved
 *
 */

#
# Written by Phithon <root@leavesongs.com> in 2014 and placed in
# the public domain.
#
# phithon <root@leavesongs.com> 编写于20140621
# From: XDSEC <www.xdsec.org> & 离别歌 <www.leavesongs.com>
# Usage: 
# <?php
# require('xsshtml.class.php');
# $html = '<html code>';
# $xss = new XssHtml($html);
# $html = $xss->getHtml();
# ?\>
# 
# 需求：
# PHP Version > 5.0
# 浏览器版本：IE7+ 或其他浏览器，无法防御IE6及以下版本浏览器中的XSS
# 更多使用选项见 http://phith0n.github.io/XssHtml

class XssHtml {
    private $m_dom;
    private $m_xss;
    private $m_ok;
    private $m_AllowAttr = array('title', 'src', 'href', 'id', 'class', 'style', 'width', 'height', 'alt', 'target', 'align');
    private $m_AllowTag = array('a', 'img', 'br', 'strong', 'b', 'code', 'pre', 'p', 'div', 'em', 'span', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'table', 'ul', 'ol', 'tr', 'th', 'td', 'hr', 'li', 'u');
    /**
     * 构造函数
     *
     * @param string $html 待过滤的文本
     * @param string $charset 文本编码，默认utf-8
     * @param array $AllowTag 允许的标签，如果不清楚请保持默认，默认已涵盖大部分功能，不要增加危险标签
     */
    public function __construct($html, $charset = 'utf-8', $AllowTag = array()){
        $this->m_AllowTag = empty($AllowTag) ? $this->m_AllowTag : $AllowTag;
        $this->m_xss = strip_tags($html, '<' . implode('><', $this->m_AllowTag) . '>');
        if (empty($this->m_xss)) {
            $this->m_ok = FALSE;
            return ;
        }
        $this->m_xss = "<meta http-equiv=\"Content-Type\" content=\"text/html;charset={$charset}\">" . $this->m_xss;
        $this->m_dom = new DOMDocument();
        $this->m_dom->strictErrorChecking = FALSE;
        $this->m_ok = @$this->m_dom->loadHTML($this->m_xss);
    }

    /**
     * 获得过滤后的内容
     */
    public function getHtml()
    {
        if (!$this->m_ok) {
            return '';
        }
        $nodeList = $this->m_dom->getElementsByTagName('*');
        for ($i = 0; $i < $nodeList->length; $i++){
            $node = $nodeList->item($i);
            if (in_array($node->nodeName, $this->m_AllowTag)) {
                if (method_exists($this, "__node_{$node->nodeName}")) {
                    call_user_func(array($this, "__node_{$node->nodeName}"), $node);
                }else{
                    call_user_func(array($this, '__node_default'), $node);
                }
            }
        }
        return strip_tags($this->m_dom->saveHTML(), '<' . implode('><', $this->m_AllowTag) . '>');
    }

    private function __true_url($url){
        if (preg_match('#^https?://.+#is', $url)) {
            return $url;
        }else{
            return 'http://' . $url;
        }
    }

    private function __get_style($node){
        if ($node->attributes->getNamedItem('style')) {
            $style = $node->attributes->getNamedItem('style')->nodeValue;
            $style = str_replace('\\', ' ', $style);
            $style = str_replace(array('&#', '/*', '*/'), ' ', $style);
            $style = preg_replace('#e.*x.*p.*r.*e.*s.*s.*i.*o.*n#Uis', ' ', $style);
            return $style;
        }else{
            return '';
        }
    }

    private function __get_link($node, $att){
        $link = $node->attributes->getNamedItem($att);
        if ($link) {
            return $this->__true_url($link->nodeValue);
        }else{
            return '';
        }
    }

    private function __setAttr($dom, $attr, $val){
        if (!empty($val)) {
            $dom->setAttribute($attr, $val);
        }
    }

    private function __set_default_attr($node, $attr, $default = '')
    {
        $o = $node->attributes->getNamedItem($attr);
        if ($o) {
            $this->__setAttr($node, $attr, $o->nodeValue);
        }else{
            $this->__setAttr($node, $attr, $default);
        }
    }

    private function __common_attr($node)
    {
        $list = array();
        foreach ($node->attributes as $attr) {
            if (!in_array($attr->nodeName, 
                $this->m_AllowAttr)) {
                $list[] = $attr->nodeName;
            }
        }
        foreach ($list as $attr) {
            $node->removeAttribute($attr);
        }
        $style = $this->__get_style($node);
        $this->__setAttr($node, 'style', $style);
        $this->__set_default_attr($node, 'title');
        $this->__set_default_attr($node, 'id');
        $this->__set_default_attr($node, 'class');
    }

    private function __node_img($node){
        $this->__common_attr($node);

        $this->__set_default_attr($node, 'src');
        $this->__set_default_attr($node, 'width');
        $this->__set_default_attr($node, 'height');
        $this->__set_default_attr($node, 'alt');
        $this->__set_default_attr($node, 'align');

    }

    private function __node_a($node){
        $this->__common_attr($node);
        $href = $this->__get_link($node, 'href');

        $this->__setAttr($node, 'href', $href);
        $this->__set_default_attr($node, 'target', '_blank');
    }

    private function __node_embed($node){
        $this->__common_attr($node);
        $link = $this->__get_link($node, 'src');

        $this->__setAttr($node, 'src', $link);
        $this->__setAttr($node, 'allowscriptaccess', 'never');
        $this->__set_default_attr($node, 'width');
        $this->__set_default_attr($node, 'height');
    }

    private function __node_default($node){
        $this->__common_attr($node);
    }
}

?>

具体使用方法可以参阅：http://phith0n.github.io/XssHtml/ 这里有详细说明。

我还在自己主机上搭建了一个使用该类的一个test，希望有同学能找到BUG，完善过滤类。地址是 http://xsshtml.leavesongs.com/

为您推荐

G.O.S.S.I.P 阅读推荐 2024-10-30 SmartAxe

G.O.S.S.I.P 阅读推荐 2024-10-28 Query Provenance Analysis

G.O.S.S.I.P 阅读推荐 2024-10-24 To Write & To Execute

CSB专题安全研究 | 化工工艺设备的远程隔离

G.O.S.S.I.P 阅读推荐 2024-11-01 交叉火线—对苹果设备跨异构计算单元内存的模糊测试

目次 | 《信息安全研究》第10卷2024年第9期