使用htmlagilitypack+xpath抓取网页内容示例

时间：12-15来源：作者：点击数：

本文使用htmlagilitypack+xpath抓取网页内容示例，用简单的例子展示如何使用htmlagilitypack抓取网页，可以用来做数据采集等功能。

首先在nuget中获取htmlagilitypack最新版本，当前最新版本是1.8.5稳定版。

本文以抓取汽车之家车家号的文章为例子，抓取链接地址为https://chejiahao.autohome.com.cn/info/2606070#pvareaid=2808158

为了直观显示，我创建了一个winform应用，并用几个简单的控件展示数据。

使用htmlagilitypack首先要了解XPath语法，我们查看网页的源码发现正文内容在class为article-content example的div里面，所以要从真个html里面找到class=article-content example的div节点，用XPath语法表示就是//div[@class='article-content example']，注意，这里必须是class='article-content example'，如果写class='article-content'是无法正确找到该节点的。

以下是获取html和网页标题和p标签内容的简单示例代码：

private void button1_Click(object sender, EventArgs e)
{
    string url = tbUrl.Text.Trim();
    HtmlWeb hw = new HtmlWeb();
    HtmlDocument hd = hw.Load(url);
    richTextBox1.Text = hd.Text;
    var title = hd.DocumentNode.SelectSingleNode("//title");
    if (title != null) {
        tbTitle.Text = title.InnerText;
    }
    string article = string.Empty;
    var contentDiv = hd.DocumentNode.SelectSingleNode("//div[@class='article-content example']");
    if (contentDiv != null) {
        var pList = contentDiv.SelectNodes(".//p");
        if (pList != null) {
            foreach (var p in pList) {
                if (!string.IsNullOrWhiteSpace(p.InnerText)) {
                    article += p.InnerText + "\r";
                }
            }
        }
    }
    richTextBox2.Text = article.Trim();
}

代码中首先找到了class='article-content example'的div节点，然后在这个节点下找p子节点，找到p子节点后遍历p子节点，把p节点下的文字取出来，然后显示在richtextbox2中，运行结果如下图所示：

那么如果我们想顺序把图片和文字内容都采集下来该怎么办呢？我们再分析一下该页面的html代码，不难发现，在内容div下的子节点里有两种div，一种是<div ahe-role="image" class="ahe__area ahe__block ahe__image">，这个div里面是图片，还有一种是<div ahe-role="text" class="ahe__area ahe__block ahe__text">，这个div里面包含的是段落文字。

那么我们可以遍历内容div下的子节点，然后根据子节点的ahe-role属性判断是图片还是文字，如果是图片就再取该节点下的图片标签，并且取图片标签的url属性，值得注意的是，该页面图片使用了懒加载，所以图片url在img标签的data-original属性上。如果是段落文字就取该子节点下的p标签里的文字内容。具体实现代码如下：

private void button2_Click(object sender, EventArgs e)
{
    string url = tbUrl.Text.Trim();
    HtmlWeb hw = new HtmlWeb();
    HtmlDocument hd = hw.Load(url);
    richTextBox1.Text = hd.Text;
    var title = hd.DocumentNode.SelectSingleNode("//title");
    if (title != null)
    {
        tbTitle.Text = title.InnerText;
    }
    string article = string.Empty;
    var contentDiv = hd.DocumentNode.SelectSingleNode("//div[@class='article-content example']");
    if (contentDiv != null)
    {
        foreach (var child in contentDiv.ChildNodes) {
            //遍历内容div的子节点，找到ahe-role属性
            var role = child.Attributes.FirstOrDefault(x => x.Name == "ahe-role");
            if (role != null) {
                //如果属性值=image表示该节点内容为图片
                if (role.Value == "image")
                {
                    var images = child.SelectNodes(".//img");
                    if (images != null) {
                        foreach (var image in images) {
                            //图片使用懒加载，所以真实的图片地址在data-original属性上
                            var imageUrlAttr = image.Attributes.FirstOrDefault(x => x.Name == "data-original");
                            if (imageUrlAttr != null)
                            {
                                article += imageUrlAttr.Value + "\r";
                            }
                        }
                    }
                }
                //如果属性值=p表示该节点内容为文字
                else if (role.Value == "text") {
                    var pList = child.SelectNodes(".//p");
                    if (pList != null)
                    {
                        foreach (var p in pList)
                        {
                            if (!string.IsNullOrWhiteSpace(p.InnerText))
                            {
                                article += p.InnerText + "\r";
                            }
                        }
                    }
                }
            }
        }
    }
    richTextBox2.Text = article.Trim();
}

运行结果如下：

大家在实际编写代码的时候可以根据目标网页实际的html布局情况来使用恰当的规则抓取内容，通过这个办法可以方便的抓取网页上游泳的信息，实现一些爬虫类的功能了。

方便获取更多学习、工作、生活信息请关注本站微信公众号 城东书院微信服务号

来顶一下

返回首页

上一篇:.NET Core Web网站设置默认页index.html 下一篇:c#无损压缩图片代码，可设置压缩质量

高考“电脑阅卷”现场	Calibre – 一站式电
新手火腿必看！超详细的	图解 \| 原来这就是TCP