吉他谱网,范绍慧,2014梦想演唱会
目录
汉字通过url encode(utf-8)编码出来的编码,里面的字符全是打字节
如果你复制粘贴下来这个网址,出来的不是汉字,而是编码后的字节
https://www.baidu.com/s?wd=%e7%bc%96%e7%a8%8b%e5%90%a7
我们也可以在python中做转换-urllib.parse.urlencode
import urllib.parse.urlencode url = "http://www.baidu.com/s?" wd = {"wd": "编程吧"} out = urllib.parse.urlencode(wd) print(out)
结果是: wd=%e7%bc%96%e7%a8%8b%e5%90%a7
import urllib.parse import urllib.request url = "http://www.baidu.com/s?" keyword = input("please input query: ") wd = {"wd": keyword} wd = urllib.parse.urlencode(wd) fullurl = url + "?" + wd headers = {"user-agent": "mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/71.0.3578.98 safari/537.36"} request = urllib.request.request(fullurl, headers = headers) response = urllib.request.urlopen(request) html = response.read() print(html)
对于一个贴吧(编程吧)爬虫,可以翻页,我们可以总结规律
page 1: http://tieba.baidu.com/f?kw=%e7%bc%96%e7%a8%8b&ie=utf-8&pn=0 page 2: http://tieba.baidu.com/f?kw=%e7%bc%96%e7%a8%8b&ie=utf-8&pn=50 page 3: http://tieba.baidu.com/f?kw=%e7%bc%96%e7%a8%8b&ie=utf-8&pn=100
import urllib.request import urllib.parse def loadpage(url,filename): """ 作用: url发送请求 url:地址 filename: 处理的文件名 """ print("正在下载", filename) headers = { "user-agent": "mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/71.0.3578.98 safari/537.36"} request = urllib.request.request(url, headers=headers) response = urllib.request.urlopen(request) html = response.read() return html def writepage(html,filename): """ 作用:将html内容写入到本地 html:服务器响应文件内容 """ print("正在保存",filename) with open(filename, "wb") as f: f.write(html) print("-"*30) def tiebaspider(url, beginpage, endpage): """ 作用:贴吧爬虫调度器,复制组合处理每个页面的url """ for page in range(beginpage, endpage + 1): pn = (page - 1) * 50 filename = "第" + str(page) + "页.html" fullurl = url + "&pn=" + str(pn) html = loadpage(fullurl,filename) writepage(html,filename) if __name__ == "__main__": kw = input("please input query: ") beginpage = int(input("start page: ")) endpage = int(input("end page: ")) url = "http://tieba.baidu.com/f?" key = urllib.parse.urlencode({"kw":kw}) fullurl = url + key tiebaspider(fullurl, beginpage, endpage)
结果是:
please input query: 编程吧 start page: 1 end page: 5 正在下载 第1页.html 正在保存 第1页.html ------------------------------ 正在下载 第2页.html 正在保存 第2页.html ------------------------------ 正在下载 第3页.html 正在保存 第3页.html ------------------------------ 正在下载 第4页.html 正在保存 第4页.html ------------------------------ 正在下载 第5页.html 正在保存 第5页.html ------------------------------
对于get请求:查询参数在querystring里保存
对于post请求: 茶韵参数在webform里面
post http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionfrom=null http/1.1 host: fanyi.youdao.com connection: keep-alive content-length: 254 accept: application/json, text/javascript, */*; q=0.01 origin: http://fanyi.youdao.com x-requested-with: xmlhttprequest user-agent: mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/71.0.3578.98 safari/537.36 content-type: application/x-www-form-urlencoded; charset=utf-8 referer: http://fanyi.youdao.com/ accept-encoding: gzip, deflate accept-language: zh-cn,zh;q=0.9,en-us;q=0.8,en;q=0.7,en-ca;q=0.6 cookie: outfox_search_user_id=-1071824454@10.169.0.83; outfox_search_user_id_ncoo=848207426.083082; jsessionid=aaaiykbb5lz2t6ro6rcgw; ___rl__test__cookies=1546662813170 x-hd-token: rent-your-own-vps # 这一行是form表单数据,重要 i=love&from=auto&to=auto&smartresult=dict&client=fanyideskweb&salt=15466628131726&sign=63253c84e50c70b0125b869fd5e2936d&ts=1546662813172&bv=363eb5a1de8cfbadd0cd78bd6bd43bee&doctype=json&version=2.1&keyfrom=fanyi.web&action=fy_by_realtime&typoresult=false
i=love doctype=json version=2.1 keyfrom=fanyi.web action=fy_by_realtime typoresult=false
import urllib.request import urllib.parse # 通过抓包方式获取,并不是浏览器上面的url地址 url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionfrom=null" # 完整的headers headers = { "accept" : "application/json, text/javascript, */*; q=0.01", "x-requested-with" : "xmlhttprequest", "user-agent" : "mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/71.0.3578.98 safari/537.36", "content-type" : "application/x-www-form-urlencoded; charset=utf-8" } # 输入用户接口 key = input("please input english: ") # 模拟有道翻译传回的form数据 # 这是post发送到服务器的form数据,post是有数据提交到web服务器的,与服务器做一个交互,通过传的数据返回响应的文件,而get不会发数据 formdata = { "i":key, "doctype":"json", "version":"2.1", "keyfrom":"fanyi.web", "action":"fy_by_realtime", "typoresult": "false" } # 通过转码 data = urllib.parse.urlencode(formdata).encode("utf-8") # 通过data和header数据,就可以构建post请求,data参数有值,就是post,没有就是get request = urllib.request.request(url, data=data, headers=headers) response = urllib.request.urlopen(request) html = response.read() print(html)
结果如下:
please input english: hello b' {"type":"en2zh_cn","errorcode":0,"elapsedtime":1,"translateresult":[[{"src":"hello","tgt":"\xe4\xbd\xa0\xe5\xa5\xbd"}]]}\n'
如对本文有疑问,请在下面进行留言讨论,广大热心网友会与你互动!! 点击进行留言回复
Python爬虫:Request Payload和Form Data的简单区别说明
浅谈Python中threading join和setDaemon用法及区别说明
Python3-异步进程回调函数(callback())介绍
python继承threading.Thread实现有返回值的子类实例
Python中使用threading.Event协调线程的运行详解
网友评论