惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

爱范儿
爱范儿
博客园_首页
W
WeLiveSecurity
S
Secure Thoughts
S
Security @ Cisco Blogs
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Hugging Face - Blog
Hugging Face - Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
H
Hacker News: Front Page
Project Zero
Project Zero
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
U
Unit 42
N
News and Events Feed by Topic
N
News and Events Feed by Topic
Hacker News - Newest:
Hacker News - Newest: "LLM"
Forbes - Security
Forbes - Security
T
Tor Project blog
I
Intezer
B
Blog
F
Full Disclosure
Security Archives - TechRepublic
Security Archives - TechRepublic
F
Fortinet All Blogs
Schneier on Security
Schneier on Security
T
Threat Research - Cisco Blogs
AI
AI
Google DeepMind News
Google DeepMind News
L
LINUX DO - 最新话题
Cloudbric
Cloudbric
L
Lohrmann on Cybersecurity
WordPress大学
WordPress大学
博客园 - 聂微东
雷峰网
雷峰网
P
Privacy International News Feed
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
PCI Perspectives
PCI Perspectives
Y
Y Combinator Blog
Spread Privacy
Spread Privacy
Simon Willison's Weblog
Simon Willison's Weblog
罗磊的独立博客
Vercel News
Vercel News
A
Arctic Wolf
The Register - Security
The Register - Security
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Microsoft Azure Blog
Microsoft Azure Blog
H
Heimdal Security Blog
Know Your Adversary
Know Your Adversary
P
Proofpoint News Feed
C
Cybersecurity and Infrastructure Security Agency CISA
P
Proofpoint News Feed

博客园 - 小牛哥

招聘.NET软件工程师 2人 招聘.NET软件工程师 2人 招聘.NET高级程序员[深圳] Windows 2003不能用 '..' 表示父目录解决方法 使用JS创建虚拟目录,并引导进入浏览 判断一个字符是否为汉字 Migration from J2EE to .NET 无法打开 Web 项目“DottextWeb”问题的解决 如何解决一个小问题:当前不会命中断点 Lucene.Net的问题我找到了,郁闷 插入表情图标的功能 事务死锁的问题如何解决? 创建虚拟目录和移除虚拟目录 - 小牛哥 - 博客园 Unclean shutdown of previous Apache run? - 小牛哥 VB.NET实现Singleton模式 启动一个进层阻止当前线程 使用DataReader填充DataTable Asc和Chr 获得一个随机数
将Html代码转换为Text
小牛哥 · 2004-09-25 · via 博客园 - 小牛哥

在抓取html页时,需要过滤掉html代码,获取Html源代码中的Text,有正则表达式可以解决这个问题:
VB.NET

    ''' -----------------------------------------------------------------------------
    ''' <summary>
    ''' 移除所有的html标签
    ''' </summary>
    ''' <param name="HTML">html代码</param>
    ''' <returns></returns>
    ''' <remarks>
    ''' </remarks>
    ''' <history>
    '''     [Administrator]    2004-9-25    Created
    ''' </history>
    ''' -----------------------------------------------------------------------------
    Public Function ParseTags(ByVal HTML As StringAs String
        
' 使用正则表达式识别并移除所有的html标签,返回过滤掉Html标签的文本
        Dim objRegEx As System.Text.RegularExpressions.Regex
        
Return objRegEx.Replace(HTML, "<[^>]*>""")
    
End Function

C#

        /// <summary>
        
/// 移除所有的html标签
        
/// </summary>
        
/// <param name="HTML">html源代码</param>
        
/// <returns></returns>

        public string ParseTags(string HTML) 
        

            
return System.Text.RegularExpressions.Regex.Replace(HTML, "<[^>]*>"""); 
        }

提供一简单示例如下:
VB.NET

    Private Sub Page_Load(ByVal sender As System.ObjectByVal e As System.EventArgs) Handles MyBase.Load
        
Dim oStringBuilder As System.Text.StringBuilder

        oStringBuilder 
= New System.Text.StringBuilder
        oStringBuilder.Append(ControlChars.CrLf 
+ "<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Transitional//EN"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "<HTML>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "    <HEAD>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        <title>WebForm1</title>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        <meta name=""GENERATOR"" content=""Microsoft Visual Studio .NET 7.1"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        <meta name=""CODE_LANGUAGE"" content=""Visual Basic .NET 7.1"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        <meta name=""vs_defaultClientScript"" content=""JavaScript"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        <meta name=""vs_targetSchema"" content=""http://schemas.microsoft.com/intellisense/ie5"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "    </HEAD>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "    <body MS_POSITIONING=""GridLayout"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        <form id=""Form1"" method=""post"" runat=""server"">")
        oStringBuilder.Append(ControlChars.CrLf 
+ "            <FONT face=""宋体"">测试</FONT>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "        </form>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "    </body>")
        oStringBuilder.Append(ControlChars.CrLf 
+ "</HTML>")
        Response.
Write(ParseTags(oStringBuilder.ToString))
    
End Sub

C#

        private void Page_Load(object sender, System.EventArgs e)
        
{
            System.Text.StringBuilder oStringBuilder; 
            oStringBuilder 
= new System.Text.StringBuilder(); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "<HTML>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "  <HEAD>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <title>WebForm1</title>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <meta name="GENERATOR" content="Microsoft Visual Studio .NET 7.1">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <meta name="CODE_LANGUAGE" content="Visual Basic .NET 7.1">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <meta name="vs_defaultClientScript" content="JavaScript">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <meta name="vs_targetSchema" content="http://schemas.microsoft.com/intellisense/ie5">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "  </HEAD>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "  <body MS_POSITIONING="GridLayout">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <form id="Form1" method="post" runat="server">"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "      <FONT face="宋体">测试</FONT>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    </form>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "  </body>"); 
            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "</HTML>"); 
            Response.Write(ParseTags(oStringBuilder.ToString()));
        }

输出结果为:

WebForm1 测试