抓取(爬取)网上信息的脚本程序,俗称网络蜘蛛。
powershell中自带了这样的两个命令,【invoke-webrequest】和【invoke-restmethod】,但这两个命令有时候会乱码。
现在转帖分享, 某个【歪果仁】写的脚本。来源于 墙外出处:
核心代码
function read-htmlpage { param ([parameter(mandatory=$true, position=0, valuefrompipeline=$true)][string] $uri) # invoke-webrequest and invoke-restmethod can't work properly with utf-8 response so we need to do things this way. [net.httpwebrequest]$webrequest = [net.webrequest]::create($uri) [net.httpwebresponse]$webresponse = $webrequest.getresponse() $reader = new-object io.streamreader($webresponse.getresponsestream()) $response = $reader.readtoend() $reader.close() # create the document class [mshtml.htmldocumentclass] $doc = new-object -com "htmlfile" $doc.ihtmldocument2_write($response) # returns a htmldocumentclass instance just like invoke-webrequest parsedhtml $doc #powershell 传教士 转帖并修改的文章 2016-01-01, 允许再次转载,但必须保留名字和出处,否则追究法律责任 }
原文函数
function read-htmlpage { param ([parameter(mandatory=$true, position=0, valuefrompipeline=$true)][string] $uri) # invoke-webrequest and invoke-restmethod can't work properly with utf-8 response so we need to do things this way. [net.httpwebrequest]$webrequest = [net.webrequest]::create($uri) [net.httpwebresponse]$webresponse = $webrequest.getresponse() $reader = new-object io.streamreader($webresponse.getresponsestream()) $response = $reader.readtoend() $reader.close() # create the document class [mshtml.htmldocumentclass] $doc = new-object -com "htmlfile" $doc.ihtmldocument2_write($response) # returns a htmldocumentclass instance just like invoke-webrequest parsedhtml $doc }
powershell function you can use for reading utf8 encoded html pages content. the built in invoke-webrequest and invoke-restmethod fail miserably.
您可能感兴趣的文章:
如您对本文有疑问或者有任何想说的,请点击进行留言回复,万千网友为您解惑!
网友评论