您当前的位置:首页 > 计算机 > 编程开发 > Html+Div+Css(前端)

htmlparser2 快速/兼容性 HTML/XML/RSS 解析器

时间:12-14来源:作者:点击数:

htmlparser2 是一个快速和宽容的 HTML/XML/RSS 解析器,解析器可以出来流,并且提供了一个回调接口。

安装

npm install htmlparser2

使用方法

const parser = new htmlparser.Parser(handler /*: Object */, options /*?: Object */);

用法举例

var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
  onopentag: function(name, attribs){
    if(name === "script" && attribs.type === "text/javascript"){
      console.log("JS! Hooray!");
    }
  },
  ontext: function(text){
    console.log("-->", text);
  },
  onclosetag: function(tagname){
    if(tagname === "script"){
      console.log("That's it?!");
    }
  }
}, {decodeEntities: true});
parser.write("Xyz <script type='text/javascript'>var foo = '<<bar>>';</ script>");
parser.end();

输出结果:

--> Xyz 
JS! Hooray!
--> var foo = '<<bar>>';
That's it?!

Usage with streams

While the Parser interface closely resembles Node.js streams, it's not a 100% match. Use the WritableStream interface to process a streaming input:

const { WritableStream } = require("htmlparser2/lib/WritableStream");
const parserStream = new WritableStream({
  ontext(text) {
    console.log("Streaming:", text);
  },
});

const htmlStream = fs.createReadStream("./my-file.html");
htmlStream.pipe(parserStream).on("finish", () => console.log("done"));

Getting a DOM

The DomHandler produces a DOM (document object model) that can be manipulated using the DomUtils helper.

const htmlparser2 = require("htmlparser2");

const dom = htmlparser2.parseDocument();

The DomHandler, while still bundled with this module, was moved to its own module. Have a look at that for further information.

Parsing RSS/RDF/Atom Feeds

const feed = htmlparser2.parseFeed(content, options);

Note: While the provided feed handler works for most feeds, you might want to use danmactough/node-feedparser, which is much better tested and actively maintained.

Performance

After having some artificial benchmarks for some time, @AndreasMadsen published his htmlparser-benchmark, which benchmarks HTML parses based on real-world websites.

At the time of writing, the latest versions of all supported parsers show the following performance characteristics on Travis CI (please note that Travis doesn't guarantee equal conditions for all tests):

gumbo-parser   : 34.9208 ms/file ± 21.4238
html-parser  : 24.8224 ms/file ± 15.8703
html5      : 419.597 ms/file ± 264.265
htmlparser   : 60.0722 ms/file ± 384.844
htmlparser2-dom: 12.0749 ms/file ± 6.49474
htmlparser2  : 7.49130 ms/file ± 5.74368
hubbub     : 30.4980 ms/file ± 16.4682
libxmljs     : 14.1338 ms/file ± 18.6541
parse5     : 22.0439 ms/file ± 15.3743
sax      : 49.6513 ms/file ± 26.6032

Event 事件

对于处理器,下面是可以用的键的名字,注意:只有函数才可以作为值,否则解析器会失败:

  • onopentag(name /*: string */, attributes /*: { [attributeName: string]: string } */)
  • onopentagname(name /*: string */)
  • onattribute(name /*: string */, value /*: string */)
  • ontext(text /*: string */)
  • onclosetag(name /*: string */)
  • onprocessinginstruction(name /*: string */, data /*: string */)
  • oncomment(data /*: string */)
  • oncommentend()
  • oncdatastart()
  • oncdataend()
  • onerror(error /*: Error */)
  • onreset()
  • onend()

Methods 方法

write (alias: parseChunk)

Parses a chunk of data and calls the corresponding callbacks.

end (alias: done)

Parses the end of the buffer and clears the stack, calls onend.

reset

Resets buffer & stack, calls onreset.

parseComplete

Resets the parser, parses the data & calls end.

Option: xmlMode

Indicates whether special tags (<script> and <style>) should get special treatment and if "empty" tags (eg. <br>) can have children. If false, the content of special tags will be text only.

For feeds and other XML content (documents that don't consist of HTML), set this to true. Default: false.

Option: decodeEntities

If set to true, entities within the document will be decoded. Defaults to true.

Option: lowerCaseTags

If set to true, all tags will be lowercased. If xmlMode is disabled, this defaults to true.

Option: lowerCaseAttributeNames

If set to true, all attribute names will be lowercased. This has noticeable impact on speed, so it defaults to false.

Option: recognizeCDATA

If set to true, CDATA sections will be recognized as text even if the xmlMode option is not enabled. NOTE: If xmlMode is set to true then CDATA sections will always be recognized as text.

Option: recognizeSelfClosing

If set to true, self-closing tags will trigger the onclosetag event even if xmlMode is not set to true. NOTE: If xmlMode is set to true then self-closing tags will always be recognized.

项目地址:github /fb55/htmlparser2

方便获取更多学习、工作、生活信息请关注本站微信公众号城东书院 微信服务号城东书院 微信订阅号
推荐内容
相关内容
栏目更新
栏目热门
本栏推荐