当前位置：移动技术网 > IT编程>开发语言>Asp > 写了段批量抓取某个列表页的东东

写了段批量抓取某个列表页的东东

2017年12月12日 | 移动技术网IT编程 | 我要评论

有些人当抓取程序是个宝，到目前还tnd有人在卖钱，强烈bs一下这些家伙真是的！可能偶下边这段东西比较烂哈
下边这个没有写入库功能，已经到这一步了，入库功能是很简单的事了，需要的请自己去完成吧，其它功能各位自行完善吧！把代码拷贝过去直接运行即可看到效果

dim url,list_pagecode,array_articleid,i,articleid
dim content_pagecode,content_tempcode
dim content_categoryid,content_categoryname,borderid,classid,bordername,classname
dim articletitle,articleauthor,articlefrom,articlecontent

url = "http://www.webasp.net/article/class/1.htm"
list_pagecode = gethttppage(url)
list_pagecode = regexptext(list_pagecode,"打印</th></tr>","</table><table border=0 cellpadding=5",0)
list_pagecode = regexptext(list_pagecode,"<td align=left><a href='../","'><img border=0 src='../images/authortype0.gif'",1) '取得当前列表页的文章链接，以,分隔
array_articleid = split(list_pagecode,",") '创建数组，存储文章id

for i=0 to ubound(array_articleid)-1
    articleid = array_articleid(i)    '文章id
    content_pagecode = gethttppage("http://www.webasp.net/article/"&articleid)    '取得文章页的内容

    '=========取文章分类及相关id参数开始=======================
    content_tempcode = regexptext(content_pagecode,"<a href=""/article/"">技术教程</a> >> ",">> 内容</td>",0)
    content_categoryid = regexptext(content_pagecode,"<a href='../class","/'>",1)
    borderid = split(content_categoryid,",")(0)    '大类id
    classid = split(content_categoryid,",")(1)    '子类id
        '==========检查大类是否存在开始===============
        '如果不存在则入库

        '==========检查大类是否存在结束===============
    'response.write(borderid & "," & classid & "<br />")
    content_categoryname = regexptext(content_pagecode,"/'>","</a>",1)
    bordername = split(content_categoryname,",")(0)    '大类名称
    classname = split(content_categoryname,",")(1)    '子类名称
        '==========检查子类是否存在开始===============
        '如果不存在则入库

        '==========检查子类是否存在结束===============
    '=========取文章分类及相关id参数结束=======================

    '=========取文章标题及内容开始=============================
    articletitle = regexptext(content_pagecode,"<tr><td align=center bgcolor=#dee2f5><strong>","</strong></td></tr>",0)
    articleauthor = regexptext(content_pagecode,"<tr><td><span class=blue>作者：</span>","</td></tr>",0)
    articlefrom = regexptext(content_pagecode,"<tr><td><span class=blue>来源：</span>","</td></tr>",0)
    articlecontent = regexptext(content_pagecode,"<tr><td class=content style=""word-wrap: break-word"" id=zoom>","</td></tr>"&vbcrlf&"        </table>"&vbcrlf&"    </td></tr></table>",0)
    '=========取文章标题及内容结束=============================
    response.write(articletitle& "<br /><br />")
    response.flush()
next

附几个函数：

function gethttppage(url)
    if(isobjinstalled("microsoft.xmlhttp") = false)then
        response.write "<br><br>服务器不支持microsoft.xmlhttp组件"
        err.clear
        response.end
    end if
    on error resume next
    dim http
    set http=server.createobject("msxml2.xmlhttp")
    http.open "get",url,false
    http.send()
    if(http.readystate<>4)then
        exit function
    end if
    gethttppage=bytestobstr(http.responsebody,"gb2312")
    set http=nothing
    if(err.number<>0)then
        response.write "<br><br>获取文件内容出错"
        'response.end
        err.clear
    end if
end function

function bytestobstr(codebody,codeset)
    dim objstream
    set objstream = server.createobject("adodb.stream")
    objstream.type = 1
    objstream.mode =3
    objstream.open
    objstream.write codebody
    objstream.position = 0
    objstream.type = 2
    objstream.charset = codeset
    bytestobstr = objstream.readtext
    objstream.close
    set objstream = nothing
end function

'================================================
'作用：检查组件是否已经安装
'返回值：true ----已经安装
'        false ----没有安装
'================================================
function isobjinstalled(objname)
    on error resume next
    isobjinstalled = false
    err = 0
    dim testobj
    set testobj = server.createobject(objname)
    if(0 = err)then isobjinstalled = true
    set testobj = nothing
    err = 0
end function

function regexptext(strng,strstart,strend,n)
    dim regex,match,matches,retstr
    set regex = new regexp
    regex.pattern = strstart&"([\s\s]*?)"&strend
    regex.ignorecase = true
    regex.global = true
    set matches = regex.execute(strng)
    for each match in matches
        if(n=1)then
            retstr = retstr & regex.replace(match.value,"$1") & ","
        else
            retstr = retstr & regex.replace(match.value,"$1")
        end if
    next
    regexptext = retstr
    set regex=nothing
end function

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

ASP中常用的22个FSO文件操作函数整理

在asp中，fso的意思是file system object，即文件系统对象。我们将要操纵的计算机文件系统，在这里是指位于web服务器之上。所... [阅读全文]
asp中Request.ServerVariables的参数集合

asp(vb)获取计算机名： <%set owsh = server.createobject("wscript.network")%&... [阅读全文]
ASP是使用正则提取内容里所有图片路径SRC的实现代码

函数 function regimg(thestr) dim&... [阅读全文]
javascript css实现三级目录(简单的)

是在原先的二级目录改的,先给出演示这里是css /*bg macji(http://www.macji.c... [阅读全文]
Asp.Net MVC记住用户登录信息下次直接登录功能

有的时候做网站，就需要记住用户登录信息，下次再登录网站时，不用重复输入用户名和密码，原理是浏览器的cookie把状态给记住了！那... [阅读全文]
一次性下载远程页面上的所有内容第1/2页

一次性下载远程页面上的所有内容使用方法,将上面的代码保存为一个比如:downfile.asp在浏览器上输入:http://你的地址/... [阅读全文]
asp 多字段模糊搜索的函数

比较简单直接的sql语句 recordset1.source = "select * from 表 where 字段 li... [阅读全文]
一个ACCESS数据库访问的类第1/3页

大部分asp应用，都离不开对数据库的访问及操作，所以，对于数据库部分的访问操作，我们应该单独抽象出来，封装成一个单独的类。如果所用语... [阅读全文]
pjblog2的参数第1/2页

<% '=====================================================... [阅读全文]
本人常用的asp代码原创

我把平时所用的东西，备份一下，经常更新1、循环读取form的值for each items in&nb... [阅读全文]

网友评论


验证码：

写了段批量抓取某个列表页的东东

2017年12月12日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论