惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fortinet All Blogs
Attack and Defense Labs
Attack and Defense Labs
V2EX - 技术
V2EX - 技术
O
OpenAI News
S
Secure Thoughts
H
Heimdal Security Blog
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Schneier on Security
Schneier on Security
H
Hacker News: Front Page
S
Security Affairs
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Vercel News
Vercel News
Microsoft Security Blog
Microsoft Security Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
P
Proofpoint News Feed
The Register - Security
The Register - Security
GbyAI
GbyAI
Cloudbric
Cloudbric
MongoDB | Blog
MongoDB | Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
K
Kaspersky official blog
Forbes - Security
Forbes - Security
Y
Y Combinator Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
Scott Helme
Scott Helme
Hacker News - Newest:
Hacker News - Newest: "LLM"
The Cloudflare Blog
Recorded Future
Recorded Future
人人都是产品经理
人人都是产品经理
Cyberwarzone
Cyberwarzone
C
CERT Recently Published Vulnerability Notes
Webroot Blog
Webroot Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
L
LangChain Blog
T
Tor Project blog
Microsoft Azure Blog
Microsoft Azure Blog
博客园_首页
Hacker News: Ask HN
Hacker News: Ask HN
Blog — PlanetScale
Blog — PlanetScale
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
B
Blog RSS Feed
N
News and Events Feed by Topic
阮一峰的网络日志
阮一峰的网络日志
I
Intezer
V
V2EX
T
Tailwind CSS Blog
SecWiki News
SecWiki News
NISL@THU
NISL@THU
C
Check Point Blog

博客园 - 小牛哥

招聘.NET软件工程师 2人 招聘.NET软件工程师 2人 招聘.NET高级程序员[深圳] Windows 2003不能用 '..' 表示父目录解决方法 使用JS创建虚拟目录,并引导进入浏览 判断一个字符是否为汉字 Migration from J2EE to .NET 无法打开 Web 项目“DottextWeb”问题的解决 如何解决一个小问题:当前不会命中断点 Lucene.Net的问题我找到了,郁闷 插入表情图标的功能 事务死锁的问题如何解决? 创建虚拟目录和移除虚拟目录 - 小牛哥 - 博客园 Unclean shutdown of previous Apache run? - 小牛哥 VB.NET实现Singleton模式 启动一个进层阻止当前线程 使用DataReader填充DataTable Asc和Chr 获得一个随机数
将Html代码转换为Text
小牛哥 · 2004-09-25 · via 博客园 - 小牛哥

在抓取html页时,需要过滤掉html代码,获取Html源代码中的Text,有正则表达式可以解决这个问题:
VB.NET

    ''' -----------------------------------------------------------------------------
    ''' <summary>
    ''' 移除所有的html标签
    ''' </summary>
    ''' <param name="HTML">html代码</param>
    ''' <returns></returns>
    ''' <remarks>
    ''' </remarks>
    ''' <history>
    '''     [Administrator]    2004-9-25    Created
    ''' </history>
    ''' -----------------------------------------------------------------------------
    Public Function ParseTags(ByVal HTML As StringAs String
        
' 使用正则表达式识别并移除所有的html标签,返回过滤掉Html标签的文本
        Dim objRegEx As System.Text.RegularExpressions.Regex
        
Return objRegEx.Replace(HTML, "<[^>]*>""")
    
End Function

C#

        /// <summary>
        
/// 移除所有的html标签
        
/// </summary>
        
/// <param name="HTML">html源代码</param>
        
/// <returns></returns>

        public string ParseTags(string HTML) 
        

            
return System.Text.RegularExpressions.Regex.Replace(HTML, "<[^>]*>"""); 
        }

提供一简单示例如下:
VB.NET

    Private Sub Page_Load(ByVal sender As System.ObjectByVal e As System.EventArgs) Handles MyBase.Load
        
Dim oStringBuilder As System.Text.StringBuilder

        oStringBuilder 
= New System.Text.StringBuilder
        oStringBuilder.Append(ControlChars.CrLf 
+ "<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Transitional//EN"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "<HTML>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "    <HEAD>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        <title>WebForm1</title>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        <meta name=""GENERATOR"" content=""Microsoft Visual Studio .NET 7.1"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        <meta name=""CODE_LANGUAGE"" content=""Visual Basic .NET 7.1"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        <meta name=""vs_defaultClientScript"" content=""JavaScript"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        <meta name=""vs_targetSchema"" content=""http://schemas.microsoft.com/intellisense/ie5"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "    </HEAD>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "    <body MS_POSITIONING=""GridLayout"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        <form id=""Form1"" method=""post"" runat=""server"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "            <FONT face=""宋体"">测试</FONT>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        </form>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "    </body>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "</HTML>")
        Response.
Write(ParseTags(oStringBuilder.ToString))
    
End Sub

C#

        private void Page_Load(object sender, System.EventArgs e)
        
{
            System.Text.StringBuilder oStringBuilder; 
            oStringBuilder 
= new System.Text.StringBuilder(); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "<HTML>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "  <HEAD>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <title>WebForm1</title>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <meta name="GENERATOR" content="Microsoft Visual Studio .NET 7.1">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <meta name="CODE_LANGUAGE" content="Visual Basic .NET 7.1">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <meta name="vs_defaultClientScript" content="JavaScript">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <meta name="vs_targetSchema" content="http://schemas.microsoft.com/intellisense/ie5">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "  </HEAD>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "  <body MS_POSITIONING="GridLayout">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <form id="Form1" method="post" runat="server">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "      <FONT face="宋体">测试</FONT>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    </form>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "  </body>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "</HTML>"); 
            Response.Write(ParseTags(oStringBuilder.ToString()));
        }

输出结果为:

WebForm1 测试