当前位置：移动技术网 > IT编程>脚本编程>Python > python re正则匹配网页中图片url地址的方法

python re正则匹配网页中图片url地址的方法

2019年06月12日 | 移动技术网IT编程 | 我要评论

森咲智美,左葳葳,深圳市宝安人民医院

最近写了个python抓取必应搜索首页http://cn.bing.com/的背景图片并将此图片更换为我的电脑桌面的程序，在正则匹配图片url时遇到了匹配失败问题。

要抓取的图片地址如图所示：

python re正则匹配网页中图片url地址

首先，使用这个pattern

reg = re.compile('.*g_img={url: "(http.*?jpg)"')

无论怎么匹配都匹配不到，后来把网页源码抓下来放在notepad++中查看，并用notepad++的正则匹配查找，很轻易就匹配到了，如图：

python re正则匹配网页中图片url地址

后来我写了个测试代码，把图片地址在的那一行保存在一个字符串中，很快就匹配到了，如下面代码所示，data是匹配不到的，然而line是可以匹配到的。

# -*-coding:utf-8-*-
import os
import re
 
f = open('bing.html','r')
 
line = r'''bnp.internal.close(0,0,60056); } });;g_img={url: "https://az12410.vo.msecnd.net/homepage/app/2016hw/binghalloween_bkgimg.jpg",id:'bgdiv',d:'200',cn'''
data = f.read().decode('utf-8','ignore').encode('gbk','ignore')
 
print " "
 
reg = re.compile('.*g_img={url: "(http.*?jpg)"')
 
if re.match(reg, data):
  m1 = reg.findall(data)
  print m1[0]
else:
  print("data not match .")
  
print 20*'-'
#print line
if re.match(reg, line):
  m2 = reg.findall(line)
  print m2[0]
else:
  print("line not match .")

由此可见line和data是有区别的，什么区别呢？那就是data是多行的，包含换行符，而line是单行的，没有换行符。我有在字符串line中加了换行符，结果line没有匹配到。

到这了原因就清楚了。原因就在这句话

re.compile('.*g_img={url: "(http.*?jpg)"')。

后来翻阅python文档，发现re.compile()这个函数的第二个可选参数flags。这个参数是re中定义的常量，有如下常量

re.debug display debug information about compiled expression.
re.i 
re.ignorecase perform case-insensitive matching; expressions like [a-z] will match lowercase letters, too. this is not affected by the current locale.

re.l 


re.locale make \w, \w, \b, \b, \s and \s dependent on the current locale.

re.m 


re.multiline when specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). by default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

re.s 


re.dotall make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.re.u re.unicode make \w, \w, \b, \b, \d, \d, \s and \s dependent on the unicode character properties database.new in version 2.0.

re.x 


re.verbose this flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. when a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

这里我们需要的就是re.s 让'.'匹配所有字符，包括换行符。修改正则表达式为

reg = re.compile('.*g_img={url: "(http.*?jpg)"', re.s)

即可完美解决问题。

以上这篇python re正则匹配网页中图片url地址的方法就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持移动技术网。

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

Python爬虫:Request Payload和Form Data的简单区别说明

request payload 和 form data 请求头上的参数差别在于：content-typeform datapost表单请求代码示例headers... [阅读全文]
如何基于python实现不邻接植花

有 n 个花园，按从 1 到 n 标记。在每个花园中，你打算种下四种花之一。paths[i] = [x, y] 描述了花园 x 到花园 y 的双向路径。另外，没... [阅读全文]
构建高效的python requests长连接池详解

前文：最近在搞全网的cdn刷新系统，在性能调优时遇到了requests长连接的一个问题，以前关注过长连接太多造成浪费的问题，但因为系统都是分布式扩展的，针对这种... [阅读全文]
python中threading开启关闭线程操作

在python中启动和关闭线程：首先导入threadingimport threading然后定义一个方法def serial_read():......然后定... [阅读全文]
浅谈Python中threading join和setDaemon用法及区别说明

python多线程编程时，经常会用到join()和setdaemon()方法，今天特地研究了一下两者的区别。1、join ()方法：主线程a中，创建了子线程b，... [阅读全文]
Python3-异步进程回调函数(callback())介绍

废话不多说，大家之家看代码吧！#异步'''举例：你喊你朋友吃饭，你朋友正忙，如果你一直在那等他，等你朋友忙完了，你们一块去。--同步调用你喊你朋友吃饭，你朋友正... [阅读全文]
python继承threading.Thread实现有返回值的子类实例

继承与threading.thread实现有返回值的子类mythread，废话不多说，大家直接看代码import threadingclass mythread... [阅读全文]
浅谈Python3多线程之间的执行顺序问题

一个多线程的题：定义三个线程id分别为abc，每个线程打印10遍自己的线程id，按abcabc……的顺序进行打印输出。我的解法：from threading i... [阅读全文]
Python中使用threading.Event协调线程的运行详解

threading.event机制类似于一个线程向其它多个线程发号施令的模式，其它线程都会持有一个threading.event的对象，这些线程都会等待这个事件... [阅读全文]
python 实现两个线程交替执行

我就废话不多说，直接看代码吧！import threadingimport timedef a(): while true: lockb.acquire... [阅读全文]

网友评论


验证码：

python re正则匹配网页中图片url地址的方法

2019年06月12日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论