当前位置：移动技术网 > IT编程>开发语言>Java > JAVA使用爬虫抓取网站网页内容的方法

JAVA使用爬虫抓取网站网页内容的方法

2019年07月22日 | 移动技术网IT编程 | 我要评论

本文实例讲述了java使用爬虫抓取网站网页内容的方法。分享给大家供大家参考。具体如下：

最近在用java研究下爬网技术,呵呵，入了个门,把自己的心得和大家分享下
以下提供二种方法，一种是用apache提供的包．另一种是用java自带的.

代码如下:

// 第一种方法
//这种方法是用apache提供的包,简单方便
//但是要用到以下包:commons-codec-1.4.jar
// commons-httpclient-3.1.jar
// commons-logging-1.0.4.jar
public static string createhttpclient(string url, string param) {
  httpclient client = new httpclient();
  string response = null;
  string keyword = null;
  postmethod postmethod = new postmethod(url);
//  try {
//   if (param != null)
//    keyword = new string(param.getbytes("gb2312"), "iso-8859-1");
//  } catch (unsupportedencodingexception e1) {
//   // todo auto-generated catch block
//   e1.printstacktrace();
//  }
  // namevaluepair[] data = { new namevaluepair("keyword", keyword) };
  // // 将表单的值放入postmethod中
  // postmethod.setrequestbody(data);
  // 以上部分是带参数抓取,我自己把它注销了．大家可以把注销消掉研究下
  try {
   int statuscode = client.executemethod(postmethod);
   response = new string(postmethod.getresponsebodyasstring()
     .getbytes("iso-8859-1"), "gb2312");
     //这里要注意下 gb2312要和你抓取网页的编码要一样
   string p = response.replaceall("//&[a-za-z]{1,10};", "")
     .replaceall("<[^>]*>", "");//去掉网页中带有html语言的标签
   system.out.println(p);
  } catch (exception e) {
   e.printstacktrace();
  }
  return response;
}
// 第二种方法
// 这种方法是java自带的url来抓取网站内容
public string getpagecontent(string strurl, string strpostrequest,
   int maxlength) {
  // 读取结果网页
  stringbuffer buffer = new stringbuffer();
  system.setproperty("sun.net.client.defaultconnecttimeout", "5000");
  system.setproperty("sun.net.client.defaultreadtimeout", "5000");
  try {
   url newurl = new url(strurl);
   httpurlconnection hconnect = (httpurlconnection) newurl
     .openconnection();
   // post方式的额外数据
   if (strpostrequest.length() > 0) {
    hconnect.setdooutput(true);
    outputstreamwriter out = new outputstreamwriter(hconnect
      .getoutputstream());
    out.write(strpostrequest);
    out.flush();
    out.close();
   }
   // 读取内容
   bufferedreader rd = new bufferedreader(new inputstreamreader(
     hconnect.getinputstream()));
   int ch;
   for (int length = 0; (ch = rd.read()) > -1
     && (maxlength <= 0 || length < maxlength); length++)
    buffer.append((char) ch);
   string s = buffer.tostring();
   s.replaceall("//&[a-za-z]{1,10};", "").replaceall("<[^>]*>", "");
   system.out.println(s);
   rd.close();
   hconnect.disconnect();
   return buffer.tostring().trim();
  } catch (exception e) {
   // return "错误:读取网页失败！";
   //
   return null;
  }
}

然后写个测试类:

public static void main(string[] args) {
  string url = "//www.jb51.net";
  string keyword = "移动技术网";
  createhttpclient p = new createhttpclient();
  string response = p.createhttpclient(url, keyword);
  // 第一种方法
  // p.getpagecontent(url, "post", 100500);//第二种方法
}

呵呵，看看控制台吧,是不是把网页的内容获取了

希望本文所述对大家的java程序设计有所帮助。

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

RecycleView 源码解析

ListView 能局部刷新吗？可以。。。 [阅读全文]
springboot 整合 Jpush 极光推送

产品简介：JPush 是经过考验的大规模 App 推送平台，每天推送消息数超过 5 亿条。开发者集成 SDK ... [阅读全文]
[杭电多校2020]第一场 1004 Distinct Sub-palindromes

Distinct Sub-palindromes题目链接：http://acm.hdu.edu.cn/showp... [阅读全文]
报错处理：java.lang.IllegalStateException: You need to use a Theme.AppCompat theme with this activity 

记录一个安卓报错的处理方法：java.lang.IllegalStateException: You need ... [阅读全文]
Swift -- 将本地生成的UIImage进行持久化保存（存到文件中fileManager.createFile）

//在相册或者拍照的代理方法中struct ImageSource { var img: UIImage ... [阅读全文]
Windows的Android studio安装教程

一、JDK安装1、JDK下载安装包下载地址：http://www.oracle.com/technetwor... [阅读全文]
Fragment的介绍以及加载详细说明

Fragment与Activity的区别1.Feagment是安装3.0之后才有的2.一个Activity可以运... [阅读全文]
Android Camera video数据流

在Android系统中，实现一个具有录像功能的应用程序只需要调用MediaRecorder的相应接口即可。下面简... [阅读全文]
SpringBoot +Dcloud个推成功案例复制即用

SpringBoot + Dcloud个推调用方法赋值的实体单推方法的实例部分群推的方法实例下面是工具方法类**... [阅读全文]
Jetpack Paging3分页库

Jetpack Paging3分页库简介分页库可帮助您一次加载和显示一小块数据。按需载入部分数据会减少网络带宽和... [阅读全文]

网友评论


验证码：

JAVA使用爬虫抓取网站网页内容的方法

2019年07月22日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论