当前位置：移动技术网 > IT编程>开发语言>正则 > c# 正则表达式对网页进行有效内容抽取

c# 正则表达式对网页进行有效内容抽取

2017年12月12日 | 移动技术网IT编程 | 我要评论

搜索引擎中一个比较重要的环节就是从网页中抽取出有效内容。简单来说，就是吧html文本中的html标记去掉,留下我们用ie等浏览器打开html文档看到的部分（我们这里不考虑图片）.
将html文本中的标记分为:注释,script ,style，以及其他标记分别去掉：
1.去注释,正则为:
output = regex.replace(input, @"", string.empty, regexoptions.ignorecase);
2.去script,正则为:
ouput = regex.replace(input, @"<script[^>]*?>.*?</script>", string.empty, regexoptions.ignorecase | regexoptions.singleline);
output2 = regex.replace(ouput , @"<noscript[^>]*?>.*?</noscript>", string.empty, regexoptions.ignorecase | regexoptions.singleline);
3.去style,正则为:
output = regex.replace(input, @"<style[^>]*?>.*?</style>", string.empty, regexoptions.ignorecase | regexoptions.singleline);
4.去其他html标记
result = result.replace(" ", " ");
result = result.replace(""", "\"");
result = result.replace("<", "<");
result = result.replace(">", ">");
result = result.replace("&", "&");
result = result.replace("<br>", "\r\n");
result = regex.replace(result, @"<[\s\s]*?>", string.empty, regexoptions.ignorecase);
以上的代码中大家可以看到,我使用了regexoptions.singleline参数，这个参数很重要，他主要是为了让"."(小圆点)可以匹配换行符.如果没有这个参数，大多数情况下，用上面列正则表达式来消除网页html标记是无效的.
html发展至今，语法已经相当复杂,上面只列出了几种最主要的标记,更多的去html标记的正则我将在
rost webspider 的开发过程中补充进来。
下面用c#实现了一个从html字符串中提取有效内容的类:
using system;
using system.collections.generic;
using system.text;
using system.text.regularexpressions;
class htmlextract
{
#region private attributes
private string _strhtml;
#endregion
#region public mehtods
public htmlextract(string instrhtml)
{
_strhtml = instrhtml
}
public override string extracttext()
{
string result = _strhtml;
result = removecomment(result);
result = removescript(result);
result = removestyle(result);
result = removetags(result);
return result.trim();
}
#endregion
#region private methods
private string removecomment(string input)
{
string result = input;
//remove comment
result = regex.replace(result, @"", string.empty, regexoptions.ignorecase);
return result;
}
private string removestyle(string input)
{
string result = input;
//remove all styles
result = regex.replace(result, @"<style[^>]*?>.*?</style>", string.empty, regexoptions.ignorecase | regexoptions.singleline);
return result;
}
private string removescript(string input)
{
string result = input;
result = regex.replace(result, @"<script[^>]*?>.*?</script>", string.empty, regexoptions.ignorecase | regexoptions.singleline);
result = regex.replace(result, @"<noscript[^>]*?>.*?</noscript>", string.empty, regexoptions.ignorecase | regexoptions.singleline);
return result;
}
private string removetags(string input)
{
string result = input;
result = result.replace(" ", " ");
result = result.replace(""", "\"");
result = result.replace("<", "<");
result = result.replace(">", ">");
result = result.replace("&", "&");
result = result.replace("<br>", "\r\n");
result = regex.replace(result, @"<[\s\s]*?>", string.empty, regexoptions.ignorecase);
return result;
}
#endregion

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

re正则表达式练习

判断变量名是否符合要求import re names = ["name1","_name","2_name","... [阅读全文]
通过Git上传项目到GitHub

一.git下载及安装git官网下载：https://git-scm.com/downloads。安装很简单，除了... [阅读全文]
常用前端相关知识

1.字符串.replace(正则, function(v,i,str){ return 'xxx' }) //... [阅读全文]
身份号码校验正则表达式(很强大,建议用我的!)

functionidentityCodeValid(code){varcity={11:"北京",12:"天津"... [阅读全文]
Linux-正则表达式练习题及答案

从ftp 下载regular_express.txt：过滤下载文件中包含the 关键字[root@zhang ~... [阅读全文]
神经网络架构搜索——二值可微分搜索（BATS）

二值可微分搜索（BATS）摘要方法搜索空间重定义标准 DARTS 搜索空间的问题二值神经网络搜索空间搜索的正则化... [阅读全文]
正则表达式详细图文举例介绍

文章目录正则实用网址特殊符号功能说明正则快捷表达式正则快捷表达式取反分组-GROUP（）非捕获分组回溯引用环视 ... [阅读全文]
四年奋斗在深圳的程序员，今年选择回了老家

四年奋斗在深圳的程序员，今年选择回了老家不知不觉，已经在深圳做后端开发已经4年了，仍然记得去深圳那年，是深圳最冷... [阅读全文]
生成模型——NVAE: A Deep Hierarchical Variational Autoencoder——arxiv2020.07

VAE相关改进VAE的相关改进：1）VAE和GAN结合，GAN的缺点是训练不稳定；2）VAE和flow模型结合；... [阅读全文]
正则表达式在OC字符串中的使用

1. 判断字符串是否合法使用正则表达式可以判断某些字符串是否符合预期结果.例如常用的判断手机号是否合法，判断字符... [阅读全文]

网友评论


验证码：

c# 正则表达式对网页进行有效内容抽取

2017年12月12日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论