惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Hacker News: Front Page
A
About on SuperTechFans
腾讯CDC
罗磊的独立博客
博客园 - Franky
Last Week in AI
Last Week in AI
博客园_首页
酷 壳 – CoolShell
酷 壳 – CoolShell
量子位
小众软件
小众软件
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
爱范儿
爱范儿
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
The Register - Security
The Register - Security
云风的 BLOG
云风的 BLOG
L
LangChain Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
D
Docker
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Recorded Future
Recorded Future
Vercel News
Vercel News
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
J
Java Code Geeks
有赞技术团队
有赞技术团队
V
V2EX
IT之家
IT之家
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
雷峰网
雷峰网
Jina AI
Jina AI
B
Blog RSS Feed
H
Help Net Security
N
Netflix TechBlog - Medium
Latest news
Latest news
Microsoft Azure Blog
Microsoft Azure Blog
博客园 - 司徒正美
Y
Y Combinator Blog
人人都是产品经理
人人都是产品经理
Stack Overflow Blog
Stack Overflow Blog
C
Cisco Blogs
Microsoft Security Blog
Microsoft Security Blog
阮一峰的网络日志
阮一峰的网络日志
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
大猫的无限游戏
大猫的无限游戏
C
Check Point Blog
P
Proofpoint News Feed
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
N
News and Events Feed by Topic
T
Threatpost

博客园 - Net205 Blog

我第1个可用的golang小程序 我们怎么做不到呢? Top 7 Coding Standards & Guideline Documents For C#/.NET Developers .Net开发人员必须避免的5个常见的编程错误 You are doing Scrum but the Scrum Master tells the team what to do! Asp.net Mvc中使用HTML 5 data属性 使用扩展方法 使用Javascript制作一个始终可见的区域 阅读优秀代码是提高开发人员修为的一种捷径[收藏] SQL SERVER – Difference between COUNT(DISTINCT) vs COUNT(ALL) SQL Performance MSSQL删除重复数据 jQuery QUnit 万月薪的英语人是如何练成的!!!讲一口漂亮流利的英语[转] 博文阅读密码验证 - 博客园 博文阅读密码验证 - 博客园 asp.net Interview Questions - Net205 Blog jQuery资源 翻译工具,您选哪个?
strip invalid xml characters - Net205 Blog
Net205 Blog · 2009-03-17 · via 博客园 - Net205 Blog

今天有同事遇到了XML中包含特殊字符"",导致XML解析出错,他的IE7解析错误,我的FF3也解析出错,但我的IE6却可以显示正常,只是状态栏提示警告信息。
于是我在网上查找相关资料,发现W3C中指定不能包括这些特殊字符。

对于XML,我们一般只对以下字符进行转义(避免escape这些字符):
"<"      "&lt;" 
">"      "&gt;"
"\""     "&quot;" 
"\'"     "&apos;" 
"&"      "&amp;"
其实这些这符,在节点文本中使用<![CDATE[]]>处理,是允许的。

Assuming your ASP is not trying to add any non-printable characters ot the XML, it usually suffices to filter and replace characters as follows:
    For any text node child of an element:
      "<"  becomes  "&lt;"
      "&"  becomes  "&amp;"

    For any attribute value:
      "<"  becomes  "&lt;"
      "&"  becomes  "&amp;"
      '"'  becomes  '&quot;' (if you are using quote(") to delimit the attribute value)
      "'"  becomes  "&apos;" (if you are using apostrophe(') to delimit the attribute value)

但是在W3C标准中只能限制以下字符才可以正确使用
http://www.w3.org/TR/2004/REC-xml-20040204/#charsets
http://www.ivoa.net/forum/apps-samp/0808/0197.htm

XML processors MUST accept any character in the range specified for Char.

Character Range
Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

像以下16进制字符是不允许出现在XML中的,即使放在<![CDATE[]]> 中,也不能幸免遇难。
\\x00-\\x08
\\x0b-\\x0c
\\x0e-\\x1f

按Character Range说明,除了以上3段需要排除外,另外还有一些也不能在XML中使用,像#xD800-#xDFFF,由于本人不知道这些字符是个什么样,一般应用也很难会出现这些字符,所以暂不作排除,如有需要可自行加上排除处理

简单处理c# code:
string content = "slei20sk<O?`";
content = Regex.Replace(content, "[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "*");
Response.Write(content);


网上实例代码(区别XML1.0和XML1.1,特别注意XML1.0和XML1.1不同)
http://balajiramesh.wordpress.com/2008/05/30/strip-illegal-xml-characters-based-on-w3c-standard/
W3C has defined a set of illegal characters for use in XML . You can find info about the same here:
XML 1.0(http://www.w3.org/TR/2006/REC-xml-20060816/#charsets) | XML 1.1(http://www.w3.org/TR/xml11/#charsets)

Here is a function to remove these characters from a specified XML file:

using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;

namespace XMLUtils
{
    class Standards
    {
        /// <summary>
        /// Strips non-printable ascii characters
        /// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1
        /// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0
        /// </summary>
        /// <param name="filePath">Full path to the File</param>
        /// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param>
        private void StripIllegalXMLChars(string filePath, string XMLVersion)
        {
            //Remove illegal character sequences
            string tmpContents = File.ReadAllText(filePath, Encoding.UTF8);

            string pattern = String.Empty;
            switch (XMLVersion)
            {
                case "1.0":
                    pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])";
                    break;
                case "1.1":
                    pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])";
                    break;
                default:
                    throw new Exception("Error: Invalid XML Version!");
            }

            Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
            if (regex.IsMatch(tmpContents))
            {
                tmpContents = regex.Replace(tmpContents, String.Empty);
                File.WriteAllText(filePath, tmpContents, Encoding.UTF8);
            }
            tmpContents = string.Empty;
        }
    }
}

补上msdn上类似的处理:
http://msdn.microsoft.com/en-us/library/k1y7hyy9(vs.71).aspx
internal void CheckUnicodeString(String value)
    {
    for (int i=0; i < value.Length; ++i) {
        if (value[i] > 0xFFFD)
        {
            throw new Exception("Invalid Unicode");
        }
        else if (value[i] < 0x20 && value[i] != '\t' & value[i] != '\n' & value[i] != '\r')
        {
            throw new Exception("Invalid Xml Characters");
        }
    } 

 附ascii表:
http://www.asciitable.com
http://code.cside.com/3rdpage/us/unicode/converter.html