当前位置：移动技术网 > IT编程>脚本编程>Python > Python下利用BeautifulSoup解析HTML的实现

Python下利用BeautifulSoup解析HTML的实现

2020年03月09日 | 移动技术网IT编程 | 我要评论

贵定县邮编,五一去哪旅游好,美女秘书

摘要

beautiful soup 是一个可以从 html 或 xml 格式文件中提取数据的 python 库，他可以将html 或 xml 数据解析为python 对象，以方便通过python代码进行处理。

文档环境

centos7.5
python2.7
beautifulsoup4

beautifu soup 使用说明

beautiful soup 的基本功能就是对html的标签进行查找及编辑。

基本概念-对象类型

beautiful soup 将复杂 html 文档转换成一个复杂的树形结构，每个节点都被转换成一个python 对象，beautiful soup将这些对象定义了4 种类型: tag、navigablestring、beautifulsoup、comment 。

对象类型	描述
beautifulsoup	文档的全部内容
tag	html的标签
navigablestring	标签包含的文字
comment	是一种特殊的navigablestring类型，当标签中的navigablestring 被注释时，则定义为该类型

安装及引用

# beautiful soup
pip install bs4

# 解析器
pip install lxml
pip install html5lib

# 初始化
from bs4 import beautifulsoup

# 方法一，直接打开文件
soup = beautifulsoup(open(""))

# 方法二，指定数据
resp = "<html>data</html>"
soup = beautifulsoup(resp, 'lxml')

# soup 为 beautifulsoup 类型对象
print(type(soup))

标签搜索及过滤

基本方法

标签搜索有find_all() 和find() 两个基本的搜索方法，find_all() 方法会返回所有匹配关键字的标签列表，find()方法则只返回一个匹配结果。

soup = beautifulsoup(resp, 'lxml')

# 返回一个标签名为"a"的tag
soup.find("a")

# 返回所有tag 列表
soup.find_all("a")

## find_all方法可被简写
soup("a")

#找出所有以b开头的标签
for tag in soup.find_all(re.compile("^b")):
  print(tag.name)

#找出列表中的所有标签
soup.find_all(["a", "p"])

# 查找标签名为p，class属性为"title"
soup.find_all("p", "title")

# 查找属性id为"link2"
soup.find_all(id="link2")

# 查找存在属性id的
soup.find_all(id=true)

#
soup.find_all(href=re.compile("elsie"), id='link1')

# 
soup.find_all(attrs={"data-foo": "value"})

#查找标签文字包含"sisters"
soup.find(string=re.compile("sisters"))

# 获取指定数量的结果
soup.find_all("a", limit=2)

# 自定义匹配方法
def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)

# 仅对属性使用自定义匹配方法
def not_lacie(href):
    return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)

# 调用tag的 find_all() 方法时,beautiful soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=false 

soup.find_all("title", recursive=false)

扩展方法

ind_parents()	所有父辈节点
find_parent()	第一个父辈节点
find_next_siblings()	之后的所有兄弟节点
find_next_sibling()	之后的第一个兄弟节点
find_previous_siblings()	之前的所有兄弟节点
find_previous_sibling()	之前的第一个兄弟节点
find_all_next()	之后的所有元素
find_next()	之后的第一个元素
find_all_previous()	之前的所有元素
find_previous()	之前的第一个元素

css选择器

beautiful soup支持大部分的css选择器 , 在 tag 或 beautifulsoup 对象的 .select() 方法中传入字符串参数, 即可使用css选择器的语法找到tag。

html_doc = """
<html>
<head>
 <title>the dormouse's story</title>
</head>
<body>
 <p class="title"><b>the dormouse's story</b></p>

 <p class="story">
  once upon a time there were three little sisters; and their names were
  <a href="http://example.com/elsie" rel="external nofollow" class="sister" id="link1">elsie</a>,
  <a href="http://example.com/lacie" rel="external nofollow" class="sister" id="link2">lacie</a>
  and
  <a href="http://example.com/tillie" rel="external nofollow" class="sister" id="link3">tillie</a>;
  and they lived at the bottom of a well.
 </p>

 <p class="story">...</p>
"""

soup = beautifulsoup(html_doc)

# 所有 a 标签
soup.select("a")

# 逐层查找
soup.select("body a")
soup.select("html head title")

# tag标签下的直接子标签
soup.select("head > title")
soup.select("p > #link1")

# 所有匹配标签之后的兄弟标签
soup.select("#link1 ~ .sister")

# 匹配标签之后的第一个兄弟标签
soup.select("#link1 + .sister")

# 根据calss类名
soup.select(".sister")
soup.select("[class~=sister]")

# 根据id查找
soup.select("#link1")
soup.select("a#link1")

# 根据多个id查找
soup.select("#link1,#link2")

# 根据属性查找
soup.select('a[href]')

# 根据属性值查找
soup.select('a[href^="http://example.com/"]')
soup.select('a[href$="tillie"]')
soup.select('a[href*=".com/el"]')

# 只获取一个匹配结果
soup.select(".sister", limit=1)

# 只获取一个匹配结果
soup.select_one(".sister")

标签对象方法

标签属性

soup = beautifulsoup('<p class="body strikeout" id="1">extremely bold</p><p class="body strikeout" id="2">extremely bold2</p>')
# 获取所有的 p标签对象
tags = soup.find_all("p")
# 获取第一个p标签对象
tag = soup.p
# 输出标签类型 
type(tag)
# 标签名
tag.name
# 标签属性
tag.attrs
# 标签属性class 的值
tag['class']
# 标签包含的文字内容，对象navigablestring 的内容
tag.string

# 返回标签内所有的文字内容
for string in tag.strings:
  print(repr(string))

# 返回标签内所有的文字内容, 并去掉空行
for string in tag.stripped_strings:
  print(repr(string))

# 获取到tag中包含的所有及包括子孙tag中的navigablestring内容，并以unicode字符串格式输出
tag.get_text()
## 以"|"分隔
tag.get_text("|")
## 以"|"分隔，不输出空字符
tag.get_text("|", strip=true)
获取子节点
tag.contents # 返回第一层子节点的列表
tag.children # 返回第一层子节点的listiterator 对象
for child in tag.children:
  print(child)

tag.descendants # 递归返回所有子节点
for child in tag.descendants:
  print(child)

获取父节点

tag.parent # 返回第一层父节点标签
tag.parents # 递归得到元素的所有父辈节点

for parent in tag.parents:
  if parent is none:
    print(parent)
  else:
    print(parent.name)

获取兄弟节点

# 下一个兄弟元素
tag.next_sibling 

# 当前标签之后的所有兄弟元素
tag.next_siblings
for sibling in tag.next_siblings:
  print(repr(sibling))

# 上一个兄弟元素
tag.previous_sibling

# 当前标签之前的所有兄弟元素
tag.previous_siblings
for sibling in tag.previous_siblings:
  print(repr(sibling))

元素的遍历

beautiful soup中把每个tag定义为一个“element”，每个“element”，被自上而下的在html中排列，可以通过遍历命令逐个显示标签

# 当前标签的下一个元素
tag.next_element

# 当前标签之后的所有元素
for element in tag.next_elements:
  print(repr(element))

# 当前标签的前一个元素
tag.previous_element
# 当前标签之前的所有元素
for element in tag.previous_elements:
  print(repr(element))

修改标签属性

soup = beautifulsoup('<b class="boldest">extremely bold</b>')
tag = soup.b

tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1

tag.string = "new link text."
print(tag)

修改标签内容（navigablestring)

soup = beautifulsoup('<b class="boldest">extremely bold</b>')
tag = soup.b
tag.string = "new link text."

添加标签内容（navigablestring)

soup = beautifulsoup("<a>foo</a>")
tag = soup.a
tag.append("bar")
tag.contents

# 或者

new_string = navigablestring("bar")
tag.append(new_string)
print(tag)

添加注释(comment)

注释是一个特殊的navigablestring 对象，所以同样可以通过append() 方法进行添加。

from bs4 import comment
soup = beautifulsoup("<a>foo</a>")
new_comment = soup.new_string("nice to see you.", comment)
tag.append(new_comment)
print(tag)

添加标签(tag)

添加标签方法有两种，一种是在指定标签的内部添加（append方法），另一种是在指定位置添加(insert、insert_before、insert_after方法)

append方法

soup = beautifulsoup("<b></b>")
tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com" rel="external nofollow" )
new_tag.string = "link text."
tag.append(new_tag)
print(tag)

* insert方法，是指在当前标签子节点列表的指定位置插入对象（tag或navigablestring）

html = '<b><a href="http://example.com/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >i linked to <i>example.com</i></a></b>'
soup = beautifulsoup(html)
tag = soup.a
tag.contents
tag.insert(1, "but did not endorse ")
tag.contents

insert_before() 和 insert_after() 方法则在当前标签之前或之后的兄弟节点添加元素

html = '<b><a href="http://example.com/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >i linked to <i>example.com</i></a></b>'
soup = beautifulsoup(html)
tag = soup.new_tag("i")
tag.string = "don't"
soup.b.insert_before(tag)
soup.b

* wrap() 和 unwrap()可以对指定的tag元素进行包装或解包,并返回包装后的结果。

```python
# 添加包装
soup = beautifulsoup("<p>i wish i was bold.</p>")
soup.p.string.wrap(soup.new_tag("b"))
#输出 <b>i wish i was bold.</b>

soup.p.wrap(soup.new_tag("div"))
#输出 <div><p><b>i wish i was bold.</b></p></div>

# 拆解包装
markup = '<a href="http://example.com/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >i linked to <i>example.com</i></a>'
soup = beautifulsoup(markup)
a_tag = soup.a

a_tag.i.unwrap()
a_tag
#输出 <a href="http://example.com/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >i linked to example.com</a>

删除标签

html = '<b><a href="http://example.com/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >i linked to <i>example.com</i></a></b>'
soup = beautifulsoup(html)
# 清楚当前标签的所有子节点
soup.b.clear()

# 将当前标签及所有子节点从soup 中移除,返回当前标签。
b_tag=soup.b.extract()
b_tag
soup

# 将当前标签及所有子节点从soup 中移除，无返回。
soup.b.decompose()

# 将当前标签替换为指定的元素
tag=soup.i
new_tag = soup.new_tag("p")
new_tag.string = "don't"
tag.replace_with(new_tag)

其他方法

输出

# 格式化输出
tag.prettify()
tag.prettify("latin-1")

使用beautiful soup解析后,文档都被转换成了unicode，特殊字符也被转换为unicode，如果将文档转换成字符串,unicode编码会被编码成utf-8.这样就无法正确显示html特殊字符了
使用unicode时,beautiful soup还会智能的把“引号”转换成html或xml中的特殊字符

文档编码

使用beautiful soup解析后,文档都被转换成了unicode，其使用了“编码自动检测”子库来识别当前文档编码并转换成unicode编码。

soup = beautifulsoup(html)
soup.original_encoding

# 也可以手动指定文档的编码 
soup = beautifulsoup(html, from_encoding="iso-8859-8")
soup.original_encoding

# 为提高“编码自动检测”的检测效率，也可以预先排除一些编码
soup = beautifulsoup(markup, exclude_encodings=["iso-8859-7"])
通过beautiful soup输出文档时,不管输入文档是什么编码方式,默认输出编码均为utf-8编码
文档解析器
beautiful soup目前支持, “lxml”, “html5lib”, 和 “html.parser”

soup=beautifulsoup("<a><b /></a>")
soup
#输出： <html><body><a><b></b></a></body></html>
soup=beautifulsoup("<a></p>", "lxml")
soup
#输出： <html><body><a></a></body></html>
soup=beautifulsoup("<a></p>", "html5lib")
soup
#输出： <html><head></head><body><a><p></p></a></body></html>
soup=beautifulsoup("<a></p>", "html.parser")
soup
#输出： <a></a>

参考文档
https://www.crummy.com/software/beautifulsoup/bs4/doc.zh

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持移动技术网。

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

新手学习Python2和Python3中print不同的用法

在python2和python3中都提供print()方法来打印信息,但两个版本间的print稍微有差异主要体现在以下几个方面：1.python3中print是... [阅读全文]
Python基于os.environ从windows获取环境变量

安装python之后，我们往往面临这样一个问题，在命令行输入“python”，竟然出错，难道是没有安装成功吗？非也，其实是你的系统环境变量没有设置好。今天，小编... [阅读全文]
keras实现调用自己训练的模型,并去掉全连接层

其实很简单from keras.models import load_modelbase_model = load_model('model_resenet.h... [阅读全文]
python中def是做什么的

python使用def开始函数定义，紧接着是函数名，括号内部为函数的参数，内部为函数的具体功能实现代码，如果想要函数有返回值, 在expressions中的逻... [阅读全文]
Python xlwt模块使用代码实例

简介写入excle文档安装：pip3 install xlwt导入：import xlwtxlrd 模块方法写入案例import xlwt# 创建对象，设置编码... [阅读全文]
Keras之自定义损失(loss)函数用法说明

在keras中可以自定义损失函数，在自定义损失函数的过程中需要注意的一点是，损失函数的参数形式，这一点在keras中是固定的，须如下形式：def my_loss... [阅读全文]
Python xlrd模块导入过程及常用操作

简介读取excle文档，支持xls，xlsx格式安装：pip3 install xlrd导入：import xlrdxlrd 模块方法读取excelfile =... [阅读全文]
keras打印loss对权重的导数方式

notes怀疑模型梯度爆炸，想打印模型 loss 对各权重的导数看看。如果如果fit来训练的话，可以用keras.callbacks.tensorboard实现... [阅读全文]
keras 使用Lambda 快速新建层添加多个参数操作

keras许多简单操作，都需要新建一个层，使用lambda可以很好完成需求。# 额外参数def normal_reshape(x, shape): return... [阅读全文]
JAVA及PYTHON质数计算代码对比解析

java 实现class primenumber{public static void main(string[] args) {long start=syst... [阅读全文]

网友评论


验证码：

Python下利用BeautifulSoup解析HTML的实现

2020年03月09日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论