当前位置：移动技术网 > IT编程>开发语言>c# > c#爬虫爬取京东的商品信息

c#爬虫爬取京东的商品信息

2019年07月18日 | 移动技术网IT编程 | 我要评论

前言

在一个小项目中,需要用到京东的所有商品id,因此就用c#写了个简单的爬虫。

在解析html中没有使用正则表达式，而是借助开源项目htmlagilitypack解析html。

下面话不多说了，来一起看看详细的介绍吧

一、下载网页html

首先我们写一个公共方法用来下载网页的html。

在写下载html方法之前，我们需要去查看京东网页请求头的相关信息，在发送请求时需要用到。

public static string downloadhtml(string url, encoding encode)
{
 string html = string.empty;
 try
 {
 httpwebrequest request = webrequest.create(url) as httpwebrequest;
 request.timeout = 30 * 1000;
 request.useragent = "mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/51.0.2704.106 safari/537.36";
 request.contenttype = "text/html; charset=utf-8";
 using (httpwebresponse response = request.getresponse() as httpwebresponse)
 {
  if (response.statuscode == httpstatuscode.ok)
  {
  try
  {
   streamreader sr = new streamreader(response.getresponsestream(), encode);
   html = sr.readtoend();//读取数据
   sr.close();
  }
  catch (exception ex)
  {
   html = null;
  }
  }
 }
 }
 catch (system.net.webexception ex)
 {
  html = null;
 }
 catch (exception ex)
 {
 html = null;
 }
 return html;
}

如上代码所示，我们使用webrequest来获取网页信息，在发送请求之前，需要先设置和京东页面一样的请求头。

以上设置的信息比较简单，但能够正常发送请求，我们也可以模拟浏览器设置cookie等等信息，

二、解析html

获取所有商品的信息分为两个步骤

（1）根据商品分类页面获取所有商品分类的url

（2）根据商品分类url获取每个商品

1、获取商品分类

try
{
 string html = httphelper.downloadurl(@"http://www.jd.com/allsort.aspx");
 htmldocument doc = new htmldocument();
 doc.loadhtml(html);
 string goodclass= @"//*[@class='items']/dl/dd";
 htmlnodecollection nonenodelist = doc.documentnode.selectnodes(goodclass);
 foreach (var node in nonenodelist)
 {
 htmldocument docchild = new htmldocument();
 docchild.loadhtml(node.outerhtml);
 string urlpath = "/dd/a";
 htmlnodecollection list = docchild.documentnode.selectnodes(urlpath);
 foreach (var l in list)
 {
  htmldocument docchild1 = new htmldocument();
  docchild1.loadhtml(l.outerhtml);
  var sorturl = l.attributes["href"].value;
  if (!string.isnullorwhitespace(sorturl) && sorturl.contains("cat="))
  {
  insertsort("https:" + sorturl);
  }
 }
 }
}
catch (exception ex)
{
 console.writeline(ex.message);
}

上面的代码中使用到了htmlagilitypack来解析html信息，这是.net的开源项目，开源在nuget包中下载。

（1）下载http://www.jd.com/allsort.aspx的html页，然后加载到htmldocument

（2）选择节点，获取每个大类的节点集合

（3）根据每个大类的节点，获取每个小类的节点信息，然后获取到分类地址

节点中也包含了其它很多信息，可以根据自己的需求去获取对应的信息

2、获取具体商品信息

（1）首先根据商品分类加载分类信息，获取到当前分类每个页面的链接

下载html之后，选择节点，可以将html格式化之后查看每个页面的url地址和拼接规则，然后借助htmlagilitypack

来筛选需要的节点，将每个页面的url分离出来

try
{
 string html = httphelper.downloadurl(@"https://list.jd.com/list.html?cat=1620,11158,11964");
 htmldocument productdoc = new htmldocument();
 productdoc.loadhtml(html);
 htmlnode pagenode = productdoc.documentnode.selectsinglenode(@"//*[@id='j_toppage']/span/i");
 if (pagenode != null)
 {
  int pagenum = convert.toint32(pagenode.innertext);
  for (int i = 1; i < pagenum + 1; i++)
  {
   string pageurl = string.format("{0}&page={1}", category.url, i).replace("&page=1&", string.format("&page={0}&", i));
   try
   {
    list<productinfo> productinfo = getpageproduct(pageurl);
   }
   catch (exception ex)
   {
    console.writeline(ex.message);
   }
  }
 }
 
}
catch (exception ex)
{
 console.writeline(ex.message);
}

（2）根据每个页面的链接，获取当前页面的商品信息

下载每个页面的所有商品信息，需要获取的商品信息在页面中都能找到

首先我们获取到每个商品的节点集合，获取到一个商品的节点信息之后，分析html数据，

找到我们需要的商品的信息所在的位置，然后将需要的信息分离出来。

下面的代码中我获取到的商品的id和title还有价格。

list<productinfo> productinfolist = new list<productinfo>();
try
{
 string html = httphelper.downloadurl(url);
 htmldocument doc = new htmldocument();
 doc.loadhtml(html);
 htmlnodecollection productnodelist = doc.documentnode.selectnodes("//*[@id='plist']/ul/li");
 if (productnodelist == null || productnodelist.count == 0)
 {
  return productinfolist;
 }
 foreach (var node in productnodelist)
 {
  htmldocument docchild = new htmldocument();
  docchild.loadhtml(node.outerhtml);
  productinfo productinfo = new productinfo()
  {
   categoryid = category.id
  };
 
  htmlnode urlnode = docchild.documentnode.selectsinglenode("//*[@class='p-name']/a");
  if (urlnode == null)
  {
   continue;
  }
  string newurl= urlnode.attributes["href"].value;
  newurl = !newurl.startswith("http:")?"http:" + newurl: newurl;
  string sid = path.getfilename(newurl).replace(".html", "");
  productinfo.productid = long.parse(sid);
  htmlnode titlenode = docchild.documentnode.selectsinglenode("//*[@class='p-name']/a/em");
  if (titlenode == null)
  {
   continue;
  }
  productinfo.title = titlenode.innertext;
  
  htmlnode pricenode = docchild.documentnode.selectsinglenode("//*[@class='p-price']/strong/i");
  if (pricenode == null)
  {
   continue;
  }
  else
  {
 
  }
  productinfolist.add(productinfo);
  
 }
 //批量获取价格
 getgoodsprice(productinfolist);
}
catch (exception ex)
{
}
return productinfolist;

商品的图片地址和价格信息的获取需要仔细分析html中的数据，然后找到规律，比如价格在每个节点中就不能单独获取。

以下为批量获取价格的代码：

try
   {
    stringbuilder sb = new stringbuilder();
    sb.appendformat("http://p.3.cn/prices/mgets?callback=jquery1069298&type=1&area=1_72_4137_0&skuids={0}&pdbp=0&pdtk=&pdpin=&pduid=1945966343&_=1469022843655", string.join("%2c", productinfolist.select(c => string.format("j_{0}", c.productid))));
    string html = httphelper.downloadurl(sb.tostring());
    if (string.isnullorwhitespace(html))
    {
     return productinfolist;
    }
    html = html.substring(html.indexof("(") + 1);
    html = html.substring(0, html.lastindexof(")"));
    list<commodityprice> pricelist = jsonconvert.deserializeobject<list<commodityprice>>(html);
    productinfolist.foreach(c => c.price = pricelist.firstordefault(p => p.id.equals(string.format("j_{0}", c.productid))).p);
   }
   catch (exception ex)
   {
    console.writeline(ex.message);
   }
   return productinfolist;

以上就是一个简单的爬取京东商品信息的爬虫，也可以根据自己的需求去解析更多的数据出来。

总结

以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作具有一定的参考学习价值，如果有疑问大家可以留言交流，谢谢大家对移动技术网的支持。

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

使用Visual Studio2019创建C#项目(窗体应用程序、控制台应用程序、Web应用程序)

一、vs的开发环境首先你得安装了vs2019，然后确认下下面三个组件是否存在，如果没有要下载一下。vs2019的安装可参考。二、创建c#窗体应用程序打开vs可以... [阅读全文]
C# 可空类型的具体使用

在项目中我们经常会遇到可为空类型，那么到底什么是可为空类型呢？下面我们将从4个方面为大家剖析。1、可空类型基础知识顾名思义，可空类型指的就是某个对象类型可以为空... [阅读全文]
C#存储相同键多个值的Dictionary实例详解

涉及到两个问题：一、访问磁盘中文件夹、文件夹下面的文件夹先看一下磁盘文件夹结构c盘下面有个根文件夹savefile，savefile下面有两个子文件夹分别为，2... [阅读全文]
vs2019安装和使用详细图文教程

vs2019已经在4月2日正式发布，vs2019发布会请看这个链接：vs2019发布活动vs2019和vs2017一样强大，项目兼容，不用互相删除，而且c/c+... [阅读全文]
C#实现猜数字小游戏

本文实例为大家分享了c#实现猜数字小游戏的具体代码，供大家参考，具体内容如下效果如图：代码：using system;using system.collecti... [阅读全文]
Visual Studio 中自定义代码片段的方法

第一步、打开 visual studio code，按ctrl + shift + p，输入：configure user snippets，选择 prefer... [阅读全文]
C#实现简单俄罗斯方块

最近在看《.net游戏编程入门经典 c#篇》第一章介绍了如何制作俄罗斯方块，自己试了试按照书上的步骤，可算是完成了。于是写下这篇文章留作纪念。1.类的设计在充... [阅读全文]
C#实现获取本地内网(局域网)和外网(公网)IP地址的方法分析

本文实例讲述了c#实现获取本地内网(局域网)和外网(公网)ip地址的方法。分享给大家供大家参考，具体如下：1、获取本机的ip地址集合：/// <summa... [阅读全文]
asp.net实现遍历Request的信息操作示例

本文实例讲述了asp.net实现遍历request的信息操作。分享给大家供大家参考，具体如下：#需求：在服务端获取从客户端发送过来的所有数据信息；#方案：1、服... [阅读全文]
浅谈Visual Studio 2019 Vue项目的目录结构

visual studio 2019 vue项目创建成功后可看到如下结构 visual studio 2019配置vue项目具体文件结构如下图模版使用入口文件... [阅读全文]

网友评论


验证码：

c#爬虫爬取京东的商品信息

2019年07月18日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论