htmlparser2 是一个快速和宽容的 HTML/XML/RSS 解析器,解析器可以出来流,并且提供了一个回调接口。
npm install htmlparser2
const parser = new htmlparser.Parser(handler /*: Object */, options /*?: Object */);
var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
onopentag: function(name, attribs){
if(name === "script" && attribs.type === "text/javascript"){
console.log("JS! Hooray!");
}
},
ontext: function(text){
console.log("-->", text);
},
onclosetag: function(tagname){
if(tagname === "script"){
console.log("That's it?!");
}
}
}, {decodeEntities: true});
parser.write("Xyz <script type='text/javascript'>var foo = '<<bar>>';</ script>");
parser.end();
输出结果:
--> Xyz
JS! Hooray!
--> var foo = '<<bar>>';
That's it?!
While the Parser interface closely resembles Node.js streams, it's not a 100% match. Use the WritableStream interface to process a streaming input:
const { WritableStream } = require("htmlparser2/lib/WritableStream");
const parserStream = new WritableStream({
ontext(text) {
console.log("Streaming:", text);
},
});
const htmlStream = fs.createReadStream("./my-file.html");
htmlStream.pipe(parserStream).on("finish", () => console.log("done"));
The DomHandler produces a DOM (document object model) that can be manipulated using the DomUtils helper.
const htmlparser2 = require("htmlparser2");
const dom = htmlparser2.parseDocument();
The DomHandler, while still bundled with this module, was moved to its own module. Have a look at that for further information.
const feed = htmlparser2.parseFeed(content, options);
Note: While the provided feed handler works for most feeds, you might want to use danmactough/node-feedparser, which is much better tested and actively maintained.
After having some artificial benchmarks for some time, @AndreasMadsen published his htmlparser-benchmark, which benchmarks HTML parses based on real-world websites.
At the time of writing, the latest versions of all supported parsers show the following performance characteristics on Travis CI (please note that Travis doesn't guarantee equal conditions for all tests):
gumbo-parser : 34.9208 ms/file ± 21.4238
html-parser : 24.8224 ms/file ± 15.8703
html5 : 419.597 ms/file ± 264.265
htmlparser : 60.0722 ms/file ± 384.844
htmlparser2-dom: 12.0749 ms/file ± 6.49474
htmlparser2 : 7.49130 ms/file ± 5.74368
hubbub : 30.4980 ms/file ± 16.4682
libxmljs : 14.1338 ms/file ± 18.6541
parse5 : 22.0439 ms/file ± 15.3743
sax : 49.6513 ms/file ± 26.6032
对于处理器,下面是可以用的键的名字,注意:只有函数才可以作为值,否则解析器会失败:
Parses a chunk of data and calls the corresponding callbacks.
Parses the end of the buffer and clears the stack, calls onend.
Resets buffer & stack, calls onreset.
Resets the parser, parses the data & calls end.
Indicates whether special tags (<script> and <style>) should get special treatment and if "empty" tags (eg. <br>) can have children. If false, the content of special tags will be text only.
For feeds and other XML content (documents that don't consist of HTML), set this to true. Default: false.
If set to true, entities within the document will be decoded. Defaults to true.
If set to true, all tags will be lowercased. If xmlMode is disabled, this defaults to true.
If set to true, all attribute names will be lowercased. This has noticeable impact on speed, so it defaults to false.
If set to true, CDATA sections will be recognized as text even if the xmlMode option is not enabled. NOTE: If xmlMode is set to true then CDATA sections will always be recognized as text.
If set to true, self-closing tags will trigger the onclosetag event even if xmlMode is not set to true. NOTE: If xmlMode is set to true then self-closing tags will always be recognized.
项目地址:github /fb55/htmlparser2

