当前位置：移动技术网 > IT编程>开发语言>Java > Java 爬虫工具Jsoup详解

Java 爬虫工具Jsoup详解

2019年07月22日 | 移动技术网IT编程 | 我要评论

java 爬虫工具jsoup详解

jsoup是一款 java 的 html 解析器，可直接解析某个 url 地址、html 文本内容。它提供了一套非常省力的 api，可通过 dom，css 以及类似于 jquery 的操作方法来取出和操作数据。

jsoup 的主要功能如下：

1. 从一个 url，文件或字符串中解析 html；
2. 使用 dom 或 css 选择器来查找、取出数据；
3. 可操作 html 元素、属性、文本；

jsoup 是基于 mit 协议发布的，可放心使用于商业项目。

jsoup 可以从包括字符串、url 地址以及本地文件来加载 html 文档，并生成 document 对象实例。

简单而言，jsoup就是先取html页面代码然后解析这些页面通过jsoup携带的满足我们绝大多数需求的各种选择器从这个页面中获取我们所需要的重要数据的一款功能强大的html解析器，但也只是相对而言，这里的页面这是死的静态页面，如果你想获取动态生成的页面数据那么你得用到其他的java 爬虫技术，我会不定时更新这些技术一起探讨。下面我们来具体谈谈如何运用jsoup

一、如何取页面

jsoup提供了用来解析html页面的方法 parse(),我们通过解析它可以获取整个页面的dom对象，通过这个对象来获取你所需要的页面所须有的参数。获取页面的方法有很多，这里就简单的列举几个：

① 通过jsoup携带的connect()方法

string htmlpage = jsoup.connect("https://www.baidu.com").get().tostring();

这个方法说需要的参数就是一个string类型的url链接，但是你的注意把这些链接的protrol加上，以免问题，其实这个方法解决了我们很多问题，我们完全可以把jsoup解析html抽取成一段通用工具类,然后通过改变拼接的url参数获取到很多我们想要的东西，举个例子:京东和淘宝的商品链接都是固定的，通过改变其三方商品id来获取商品详情参数。

string url = "https://item.jd.com/11476104681.html";

完全可以替换成

string url = "https://item.jd.com/"+skuid+".html";

通过改变他的三方商品id你就可以获取这个页面一些基本数据，像商品的图片和标题什么的都可以轻松获取，而价格因为做了一些相关方面的处理得动态的获取，这里先不做说明，后面慢慢会讲解。

②通过httpclient直接获取这个页面的静态页面

先贴一部分httpclient获取页面工具


import java.io.ioexception;
import java.io.unsupportedencodingexception;
import java.util.arraylist;
import java.util.list;
import java.util.map;
import java.util.set;

import org.apache.http.httpentity;
import org.apache.http.httpresponse;
import org.apache.http.namevaluepair;
import org.apache.http.parseexception;
import org.apache.http.client.clientprotocolexception;
import org.apache.http.client.entity.urlencodedformentity;
import org.apache.http.client.methods.httpget;
import org.apache.http.client.methods.httppost;
import org.apache.http.client.methods.httpurirequest;
import org.apache.http.impl.client.defaulthttpclient;
import org.apache.http.message.basicnamevaluepair;
import org.apache.http.protocol.http;
import org.apache.http.util.entityutils;
/**
 * http请求工具类.
 * @author luolong
 * @since 20150513
 *
 */
public class httpclientutils {
  /**
   * post方式请求.
   * @param url 请求地址.
   * @param params 请求参数
   * @return string
   */
  public static string post(string url, map<string, string> params) {
    defaulthttpclient httpclient = new defaulthttpclient();
    string body = null;

    httppost post = postform(url, params);

    body = invoke(httpclient, post);

    httpclient.getconnectionmanager().shutdown();

    return body;
  }

  /**
   * get方式请求.
   * @param url 请求地址.
   * @return string
   */
  public static string get(string url) {
    defaulthttpclient httpclient = new defaulthttpclient();
    string body = null;

    httpget get = new httpget(url);
    body = invoke(httpclient, get);

    httpclient.getconnectionmanager().shutdown();

    return body;
  }
  /**
   * 请求方法.
   * @param httpclient defaulthttpclient.
   * @param httpost 请求方式.
   * @return string
   */
  private static string invoke(defaulthttpclient httpclient,
      httpurirequest httpost) {

    httpresponse response = sendrequest(httpclient, httpost);
    string body = paseresponse(response);

    return body;
  }

  /**
   * 
   * @param response
   * @return
   */
  @suppresswarnings({ "deprecation", "unused" })
  private static string paseresponse(httpresponse response) {
    httpentity entity = response.getentity();

    string charset = entityutils.getcontentcharset(entity);

    string body = null;
    try {
      body = entityutils.tostring(entity);
    } catch (parseexception e) {
      e.printstacktrace();
    } catch (ioexception e) {
      e.printstacktrace();
    }

    return body;
  }

  private static httpresponse sendrequest(defaulthttpclient httpclient,
      httpurirequest httpost) {
    httpresponse response = null;

    try {
      response = httpclient.execute(httpost);
    } catch (clientprotocolexception e) {
      e.printstacktrace();
    } catch (ioexception e) {
      e.printstacktrace();
    }
    return response;
  }

  @suppresswarnings("deprecation")
  private static httppost postform(string url, map<string, string> params) {

    httppost httpost = new httppost(url);
    list<namevaluepair> nvps = new arraylist<namevaluepair>();

    set<string> keyset = params.keyset();
    for (string key : keyset) {
      nvps.add(new basicnamevaluepair(key, params.get(key)));
    }
    try {
      httpost.setentity(new urlencodedformentity(nvps, http.utf_8));
    } catch (unsupportedencodingexception e) {
      e.printstacktrace();
    }

    return httpost;
  }
}

通过get()方法就可以获取html页面的string类型数据

string content = httpclientutils.get(url);
或者你可以直接把页面下载到本地，然后解析此html文档获取
file input = new file(filepath);
document doc = jsoup.parse(input, "utf-8", url);

二、解析页面获取需要的数据

当你获取到页面的dom对象后，那么下面的操作就非常简单了，你只需要通过操作这个dom对象来获取页面所有的静态资源，动态加载的资源不在此列，后面在做讲解。

先贴一段百度网页的源代码：

 </form>
    <div id="m"></div>
   </div>
   </div>
   <div id="u">
   <a class="toindex" href="/" rel="external nofollow" >百度首页</a>
   <a href="javascript:;" rel="external nofollow" name="tj_settingicon" class="pf">设置<i class="c-icon c-icon-triangle-down"></i></a>
   <a href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3a%2f%2fwww.baidu.com%2f" rel="external nofollow" rel="external nofollow" name="tj_login" class="lb" onclick="return false;">登录</a>
   </div>
   <div id="u1">
   <a href="http://news.baidu.com" rel="external nofollow" name="tj_trnews" class="mnav">新闻</a>
   <a href="http://www.hao123.com" rel="external nofollow" name="tj_trhao123" class="mnav">hao123</a>
   <a href="http://map.baidu.com" rel="external nofollow" name="tj_trmap" class="mnav">地图</a>
   <a href="http://v.baidu.com" rel="external nofollow" name="tj_trvideo" class="mnav">视频</a>
   <a href="http://tieba.baidu.com" rel="external nofollow" name="tj_trtieba" class="mnav">贴吧</a>
   <a href="http://xueshu.baidu.com" rel="external nofollow" name="tj_trxueshu" class="mnav">学术</a>
   <a href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3a%2f%2fwww.baidu.com%2f" rel="external nofollow" rel="external nofollow" name="tj_login" class="lb" onclick="return false;">登录</a>
   <a href="http://www.baidu.com/gaoji/preferences.html" rel="external nofollow" name="tj_settingicon" class="pf">设置</a>
   <a href="http://www.baidu.com/more/" rel="external nofollow" name="tj_briicon" class="bri" style="display: block;">更多产品</a>
   </div>
  </div>
  </div> 
  <div class="s_tab" id="s_tab"> 
  <b>网页</b>
  <a href="http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=" rel="external nofollow" wdfield="word" onmousedown="return c({'fm':'tab','tab':'news'})">新闻</a>
  <a href="http://tieba.baidu.com/f?kw=&fr=wwwt" rel="external nofollow" wdfield="kw" onmousedown="return c({'fm':'tab','tab':'tieba'})">贴吧</a>
  <a href="http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt" rel="external nofollow" wdfield="word" onmousedown="return c({'fm':'tab','tab':'zhidao'})">知道</a>
  <a href="http://music.baidu.com/search?fr=ps&ie=utf-8&key=" rel="external nofollow" wdfield="key" onmousedown="return c({'fm':'tab','tab':'music'})">音乐</a>
  <a href="http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=" rel="external nofollow" wdfield="word" onmousedown="return c({'fm':'tab','tab':'pic'})">图片</a>
  <a href="http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=" rel="external nofollow" wdfield="word" onmousedown="return c({'fm':'tab','tab':'video'})">视频</a>
  <a href="http://map.baidu.com/m?word=&fr=ps01000" rel="external nofollow" wdfield="word" onmousedown="return c({'fm':'tab','tab':'map'})">地图</a>
  <a href="http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8" rel="external nofollow" wdfield="word" onmousedown="return c({'fm':'tab','tab':'wenku'})">文库</a>
  <a href="//www.baidu.com/more/" rel="external nofollow" onmousedown="return c({'fm':'tab','tab':'more'})">更多»</a> 
  </div> 
  <div class="qrcodecon"> 
  <div id="qrcode"> 
   <div class="qrcode-item qrcode-item-1"> 
   <div class="qrcode-img"></div> 
   <div class="qrcode-text"> 
    <p><b>手机百度</b></p> 
   </div> 
   </div> 
  </div> 
  </div> 
  <div id="ftcon"> 
  <div class="ftcon-wrapper">
   <div id="ftconw">
   <p id="lh"><a id="setf" href="//www.baidu.com/cache/sethelp/help.html" rel="external nofollow" onmousedown="return ns_c({'fm':'behs','tab':'favorites','pos':0})" target="_blank">把百度设为主页</a><a onmousedown="return ns_c({'fm':'behs','tab':'tj_about'})" href="http://home.baidu.com" rel="external nofollow" >关于百度</a><a onmousedown="return ns_c({'fm':'behs','tab':'tj_about_en'})" href="http://ir.baidu.com" rel="external nofollow" >about  baidu</a><a onmousedown="return ns_c({'fm':'behs','tab':'tj_tuiguang'})" href="http://e.baidu.com/?refer=888" rel="external nofollow" >百度推广</a></p>
   <p id="cp">©2017 baidu <a href="http://www.baidu.com/duty/" rel="external nofollow" onmousedown="return ns_c({'fm':'behs','tab':'tj_duty'})">使用百度前必读</a> <a href="http://jianyi.baidu.com/" rel="external nofollow" class="cp-feedback" onmousedown="return ns_c({'fm':'behs','tab':'tj_homefb'})">意见反馈</a> 京icp证030173号 <i class="c-icon-icrlogo"></i> <a id="jgwab" target="_blank" href="http://www.beian.gov.cn/portal/registersysteminfo?recordcode=11000002000001" rel="external nofollow" >京公网安备11000002000001号</a> <i class="c-icon-jgwablogo"></i></p>
   </div>
  </div>
  </div> 
  <div id="wrapper_wrapper"> 
  </div> 
 </div> 
 <div class="c-tips-container" id="c-tips-container"></div>

在贴上jsoup自身携带的常用的几个获取dom对象具体元素的方法：

method              description
getelementsbyclass()       通过class属性来定位元素，获取的是所有带这个class属性的集合
getelementsbytag();       通过标签名字来定位元素，获取的是所有带有这个标签名字的元素结合 
getelementbyid();        通过标签的id来定位元素，这个是精准定位，因为页面的id基本不会重复
getelementsbyattributevalue();  通过属性和属性名来定位元素，获取的也是一个满足条件的集合;
getelementsbyattributevaluematching()    通过正则匹配属性

比如说我现在要获取百度首页这个title,那么我们得先确定这玩意在哪，通过查看我们发现它是id=”u”的div标签的一个子元素，那么不管那么多我们先通过这个id取到这个对象然后在获取这个title,下面是具体操作

//获取页面对象
string startpage="https://www.baidu.com";

document document = jsoup.connect(startpage).useragent("mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/52.0.2743.116 safari/537.36").get();

//定位元素父级
element parentelement = document.getelementbyid("u");

//定位具体元素
element titleelement = parentelement.getelementsbytag("a").get(0);

//获取所需数据
string title = titleelement.text();

system.out.println(title);

又或者我需要获取页面《手机百度》这个数据：

string startpage="https://www.baidu.com";

document document = jsoup.connect(startpage).useragent("mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/52.0.2743.116 safari/537.36").get();

element elementbyid = document.getelementbyid("qrcode");

string text = elementbyid.getallelements().get(0).getallelements().get(1).getelementsbytag("b").text();

system.out.println(text);

这就是一个很简单的爬虫编写工具，jsoup功能很强大，对直接爬取没有动态加载的静态资源页面再适合不过。

感谢阅读，希望能帮助到大家，谢谢大家对本站的支持！

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

remo video repair中文版

教程：1、将下载好的压缩文件解压的，得到Remo Video Repair原程序和Crack文件夹；2、首先双击... [阅读全文]
Pow(x, n)（快速幂+迭代实现）

题目实现 pow(x, n) ，即计算 x 的 n 次幂函数。说明:1、-100.0 < x < 1... [阅读全文]
第三次学JAVA再学不好就吃翔(part88)--ArrayList嵌套ArrayList

学习笔记，仅供参考，有错必纠ArrayList嵌套ArrayList举个例子package com.guiyan... [阅读全文]
ffmpeg编译硬转码

ffmpeg4.2.2编译+ubuntu18.02--prefix=/home/firefly/work/sof... [阅读全文]
使用ffmpeg视频切片并加密和视频AES-128加密后播放

创建加密文件：将一个mp4视频文件切割为多个ts片段，并在切割过程中对每一个片段使用AES-128 加密，最后生... [阅读全文]
JAVA程序设计：最长重复子串（LeetCode：1044）

给出一个字符串S，考虑其所有重复子串（S 的连续子串，出现两次或多次，可能会有重叠）。返回任何具有最长可能长度的... [阅读全文]
LiveGBS国标GB/T28181云端录像分布式录像存储自动清理时移回看录像下载播放

分布式录像集中存储1、云端录像1.1、与设备录像|实时录像的区别1.2、按需录像1.3、一直录像1.4、录像覆... [阅读全文]
教程地址整合

尚硅谷视频一、Java基础阶段java基础新版视频教程715集：https://www.bilibili.com... [阅读全文]
剑指 Offer 03. 数组中重复的数字

剑指 Offer 03. 数组中重复的数字在一个长度为 n 的数组 nums 里的所有数字都在 0～n-1 的范... [阅读全文]
inputstream 解决只能读取一次

inputstream只能读取一次如果你需要多次读取解决方案：1.客户端一次发送两次请求表单一次，ajax中再一... [阅读全文]

网友评论


验证码：

Java 爬虫工具Jsoup详解

2019年07月22日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论