ngx_http_html_sanitize_module - 它基于作为 HTML5 解析器的 google 的 gumbo-parser( github /google/gumbo-parser) 和作为内联 CSS 解析器的 hackers-painters 的 katana-parser( github /hackers-painters/katana-parser) 来清理带有白名单元素、白名单属性和白名单 CSS 属性的 HTML。
根据 dev.w3 组织网/html5/html-author/#the-elements 有一个 nginx 配置示例,如下所示:
server {
listen 8888;
location = /sanitize {
# Explicitly set utf-8 encoding
add_header Content-Type "text/html; charset=UTF-8";
client_body_buffer_size 10M;
client_max_body_size 10M;
html_sanitize on;
# Check https://dev.w3.org/html5/html-author/#the-elements
# Root Element
html_sanitize_element html;
# Document Metadata
html_sanitize_element head title base link meta style;
# Scripting
html_sanitize_element script noscript;
# Sections
html_sanitize_element body section nav article aside h1 h2 h3 h4 h5 h6 header footer address;
# Grouping Content
html_sanitize_element p hr br pre dialog blockquote ol ul li dl dt dd;
# Text-Level Semantics
html_sanitize_element a q cite em strong small mark dfn abbr time progress meter code var samp kbd sub sup span i b bdo ruby rt rp;
# Edits
html_sanitize_element ins del;
# Embedded Content
htlm_sanitize_element figure img iframe embed object param video audio source canvas map area;
# Tabular Data
html_sanitize_element table caption colgroup col tbody thead tfoot tr td th;
# Forms
html_sanitize_element form fieldset label input button select datalist optgroup option textare output;
# Interactive Elements
html_sanitize_element details command bb menu;
# Miscellaneous Elements
html_sanitize_element legend div;
html_sanitize_attribute *.style;
html_sanitize_attribute a.href a.hreflang a.name a.rel;
html_sanitize_attribute col.span col.width colgroup.span colgroup.width;
html_sanitize_attribute data.value del.cite del.datetime;
html_sanitize_attribute img.align img.alt img.border img.height img.src img.width;
html_sanitize_attribute ins.cite ins.datetime li.value ol.reversed ol.stasrt ol.type ul.type;
html_sanitize_attribute table.align table.bgcolor table.border table.cellpadding table.cellspacing table.frame table.rules table.sortable table.summary table.width;
html_sanitize_attribute td.abbr td.align td.axis td.colspan td.headers td.rowspan td.valign td.width;
html_sanitize_attribute th.abbr th.align th.axis th.colspan th.rowspan th.scope th.sorted th.valign th.width;
html_sanitize_style_property color font-size;
html_sanitize_url_protocol http https tel;
html_sanitize_url_domain *.google.com google.com;
html_sanitize_iframe_url_protocol http https;
html_sanitize_iframe_url_domain facebook.com *.facebook.com;
}
}
并且建议使用以下命令来清理 HTML5:
$ curl -X POST -d "<h1>Hello World </h1>" http://127.0.0.1:8888/sanitize?element=2&attribute=1&style_property=1&style_property_value=1&url_protocol=1&url_domain=0&iframe_url_protocol=1&iframe_url_domain=0
<h1>Hello World </h1>
此查询字符串 element=2&attribute=1&style_property=1&style_property_value=1&url_protocol=1&url_domain=0&iframe_url_protocol=1&iframe_url_domain=0 如下:
使用ngx_http_html_sanitize_module,我们可以通过 directive 和 querystring 指定是否输出 HTML5 的元素、属性和内联 CSS 的属性,如下所示:
禁用元素:
如果我们不想输出任何元素,我们可以这样做:
curl -X POST -d "<h1>h1</h1>" http://127.0.0.1:8888/sanitize?element=0
启用元素:
如果我们想输出任何元素,我们可以这样做:
$ curl -X POST -d "<h1>h1</h1><h7>h7</h7>" http://127.0.0.1:8888/sanitize?element=1
<h1>h1</h1><h7>h7</h7>
启用白名单元素:
如果我们想输出列入白名单的元素,我们可以这样做如下
$ curl -X POST -d "<h1>h1</h1><h7>h7</h7>" http://127.0.0.1:8888/sanitize?element=1
<h1>h1</h1>
禁用属性:
如果我们不想输出任何属性,我们可以这样做:
curl -X POST -d "<h1 ha=\"ha\">h1</h1>" "http://127.0.0.1:8888/sanitize?element=1&attribute=0"
<h1>h1</h1>
启用属性:
如果我们想输出任何属性,我们可以这样做:
$ curl -X POST -d "<h1 ha=\"ha\">h1</h1>" "http://127.0.0.1:8888/sanitize?element=1&attribute=1"
<h1 ha="ha">h1</h1>
启用白名单属性:
如果我们想输出列入白名单的元素,我们可以这样做:
$ curl -X POST -d "<img src=\"/\" ha=\"ha\" />" "http://127.0.0.1:8888/sanitize?element=1&attribute=2"
<img src="/" />
禁用样式属性:
如果我们不想输出任何样式属性,我们可以这样做:
# It will do not output any style property
curl -X POST -d "<h1 style=\"color:red;\">h1</h1>" "http://127.0.0.1:8888/sanitize?element=1&attribute=1&style_property=0"
<h1>h1</h1>
启用样式属性:
如果我们想输出任何样式属性,我们可以这样做:
$ curl -X POST -d "<h1 style=\"color:red;text-align:center;\">h1</h1>" "http://127.0.0.1:8888/sanitize?element=1&attribute=1&style_property=1"
<h1 style="color:red;text-align:center">h1</h1>
启用白名单样式属性:
如果我们想输出列入白名单的样式属性,我们可以这样做:
$ curl -X POST -d "<h1 style=\"color:red;text-align:center;\" >h1</h1>" "http://127.0.0.1:8888/sanitize?element=1&attribute=1&style_property=2"
<h1 style="color:red;">h1</h1>
现在 ngx_http_html_sanitize_module 的实现基于 gumbo-parser 和 katana-parser。我们将其组合起来,然后在 nginx 上运行,作为由专业安全人员维护的中心 Web 服务,以消除语言级别的差异。如果我们想获得更高的性能(这里是基准),建议在纯 c 库之上编写语言级库包装,以克服网络传输的开销。
与测试 wrk -s benchmarks/shot.lua -d 60s "http://127.0.0.1:8888" Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz and 64GB 内存。
| Name | 大小 | 平均延迟 | QPS |
|---|---|---|---|
| hacker_news.html | 30KB | 9.06ms | 2921.82 |
| baidu.html | 76KB | 13.41ms | 1815.75 |
| arabic_newspapers.html | 78KB | 16.58ms | 1112.70 |
| bbc.html | 115KB | 17.96ms | 993.12 |
| xinhua.html | 323KB | 33.37ms | 275.39 |
| google.html | 336KB | 26.78ms | 351.54 |
| yahoo.html | 430KB | 29.16ms | 323.04 |
| wikipedia.html | 511KB | 57.62ms | 160.10 |
| html5_spec.html | 7.7MB | 1.63s | 2.00 |
优化性能的技巧是从 On-CPU Flamegraph 中学习的,如下所示:

syntax: html_sanitize on | off
default: html_sanitize on
context: location
Specifies whether enable html sanitize handler on location context
syntax: html_sanitize_hash_max_size size
default: html_sanitize_hash_max_size 2048
context: location
Sets the maximum size of the element、attribute、style_property、url_protocol、url_domain、iframe_url_protocol、iframe_url_domain hash tables.
syntax: html_sanitize_hash_bucket_size size
default: html_sanitize_hash_bucket_size 32|64|128
context: location
Sets the bucket size for element、attribute、style_property、url_protocol、url_domain、iframe_url_protocol、iframe_url_domain. The default value depends on the size of the processor’s cache line.
syntax: html_sanitize_element element ...
default: -
context: location
Set the whitelisted HTML5 elements when enable whitelisted element by setting the querystring element whitelist mode as the following:
html_sanitize_element html head body;
syntax: html_sanitize_attribute attribute ...
default: -
context: location
Set the whitelisted HTML5 attributes when enable whitelisted element by setting the querystring attribute whitelist mode as the following:
html_sanitize_attribute a.href h1.class;
PS: attribute format must be the same as element.attribute and support *.attribute (prefix asterisk) and element.* (suffix asterisk)
syntax: html_sanitize_style_property property ...
default: -
context: location
Set the whitelisted CSS property when enable whitelisted element by setting the querystring style_property whitelist mode as the following:
html_sanitize_style_property color background-color;
syntax: html_sanitize_url_protocol [protocol] ...
default: -
context: location
Set the allowed URL protocol at linkable attribute when only the URL is absoluted rahter than related and enable URL protocol check by setting the querystring url_protocol check mode as the following:
html_sanitize_url_protocol http https tel;
syntax: html_sanitize_url_domain domain ...
default: -
context: location
Set the allowed URL domain at linkable attribute when only the URL is absoluted rahter than relatived and enable URL protocol check、URL domain check by setting the querystring url_protocol check mode and the querystring [url_domain][#url_domain] check mode as the following:
html_sanitize_url_domain *.google.com google.com;
syntax: html_sanitize_iframe_url_protocol [protocol] ...
default: -
context: location
is the same as html_sanitize_url_protocol but only for iframe.src attribute
html_sanitize_iframe_url_protocol http https tel;
syntax: html_sanitize_iframe_url_domain [protocol] ...
default: -
context: location
is the same as html_sanitize_url_domain but only for iframe.src attribute
html_sanitize_iframe_url_domain *.facebook.com facebook.com;
The linkable attribute is the following:
the querystring from request URL is used to control the ngx_http_html_sanitize_module internal action.
value: 0 or 1
default: 0
context: querystring
Specifies whether append <!DOCTYPE> to response body
value: 0 or 1
default: 0
context: querystring
Specifies whether append <html></html> to response body
value: 0 or 1
default: 0
context: querystring
Specifies whether allow <script></script>
value: 0 or 1
default: 0
context: querystring
Specifies whether allow <style></style>
value: 0、1 or 2
default: 0
context: querystring
Specifies the mode of gumbo-parser with the value as the following:
value: [0, 150)
default: 38(GUMBO_TAG_DIV)
context: querystring
Specifies the context of gumbo-parser with the value at the this file tag_enum.h
value: 0、1、2
default: 0
context: querystring
Specifies the mode of output element with the value as the following:
value: 0、1、2
default: 0
context: querystring
Specifies the mode of output attribute with the value as the following:
value: 0、1、2
default: 0
context: querystring
Specifies the mode of output CSS property with the value as the following:
value: 0、1
default: 0
context: querystring
Specifies the mode of output CSS property_value with the value as the following:
value: 0、1
default: 0
context: querystring
Specifies whether check the URL protocol at linkable_attribute. The value is as the following:
value: 0、1
default: 0
context: querystring
Specifies whether check the URL domain at linkable_attribute when enable url_protocol check. The value is as the following:
value: 0、1
default: 0
context: querystring
is the same as url_protocol but only for iframe.src
value: 0、1
default: 0
context: querystring
is the same as url_domain but only for iframe.src
项目地址: github /youzan/ngx_http_html_sanitize_module

