加载超过100M的xml文件时(可能不是很常见),XmlDocument这种全部加载到内存里的模式就有点不友好了,耗时长、内存高。
这时用xmlreader就会有自行车换超跑的感觉,但其间遇到几个坑,记录一下。
先看源码,包括dom和sax两种模式的读取和写入
DOM模式:
- 1 /// <summary>
- 2 /// dom模式创建xml文件
- 3 /// </summary>
- 4 /// <param name="path"></param>
- 5 public void CreateXml_Dom(string path)
- 6 {
- 7 XmlDocument xmlDocw = new XmlDocument();
- 8 //xml头
- 9 var xmldecl = xmlDocw.CreateXmlDeclaration("1.0", "utf-8", null);
- 10 var root = xmlDocw.CreateElement("root");
- 11 root.SetAttribute("Name", "李四");
- 12 var test = xmlDocw.CreateElement("test");
- 13 root.AppendChild(test);
- 14
- 15 xmlDocw.AppendChild(xmldecl);
- 16 xmlDocw.AppendChild(root);
- 17 xmlDocw.Save(path);
- 18
- 19 //可以通过xmlreader读数据后生成节点
- 20 //var node = xmlDocw.ReadNode(rdr);
- 21 //root.AppendChild(node);
- 22 //或者读取outerxml后作为innerxml写入
- 23 //string str = rdr.ReadOuterXml();
- 24 //root.InnerXml = str;
- 25 }
- 26
- 27 /// <summary>
- 28 /// dom模式读取xml
- 29 /// </summary>
- 30 /// <param name="path"></param>
- 31 public void ReadXml_Dom(string path)
- 32 {
- 33 XmlDocument xmlDocr = new XmlDocument();
- 34 xmlDocr.Load(path);
- 35 var root = xmlDocr.DocumentElement;
- 36 string str = root.GetAttribute("Name");
- 37 Console.WriteLine(str);
- 38 }
SAX(simple API for XML)模式:几种错误也都用注释标注出来了
- 1 /// <summary>
- 2 /// xmlwriter创建xml文件
- 3 /// </summary>
- 4 /// <param name="path"></param>
- 5 public void CreateXml_Sax(string path)
- 6 {
- 7 //filestream没问题
- 8 //FileStream stream = new FileStream(path,FileMode.Create);
- 9 //会出现编码一直是utf-16问题
- 10 //StringBuilder stream = new StringBuilder();
- 11 MemoryStream stream = new MemoryStream();
- 12 XmlWriterSettings settings = new XmlWriterSettings();
- 13 //Encoding.UTF8这个会报错,字节顺序标记
- 14 settings.Encoding = new UTF8Encoding(false);
- 15 XmlWriter xw = XmlWriter.Create(stream, settings);
- 16 //XmlTextWriter xw = new XmlTextWriter(stream, new UTF8Encoding(false));
- 17
- 18 //写入声明
- 19 xw.WriteStartDocument();
- 20
- 21 xw.WriteStartElement("root");
- 22 xw.WriteAttributeString("Name", "张三");
- 23 //可以通过xmlreader读数据后直接写入
- 24 //xw.WriteNode(rdr);
- 25 xw.WriteStartElement("test");
- 26 xw.WriteEndElement();
- 27
- 28 xw.WriteEndElement();
- 29
- 30 xw.WriteEndDocument();
- 31 xw.Close();
- 32
- 33 string xmlstr = Encoding.UTF8.GetString(stream.ToArray());
- 34 stream.Close();
- 35 XmlDocument xmlDocw = new XmlDocument();
- 36 xmlDocw.LoadXml(xmlstr);
- 37 xmlDocw.Save(path);
- 38 }
- 39
- 40 /// <summary>
- 41 /// xmlreader读取xml
- 42 /// </summary>
- 43 /// <param name="path"></param>
- 44 public void ReadXml_Sax(string path)
- 45 {
- 46 XmlDocument xmlDocw = new XmlDocument();
- 47 XmlReaderSettings rsettings = new XmlReaderSettings();
- 48 rsettings.IgnoreComments = true;
- 49 rsettings.IgnoreWhitespace = false;
- 50 rsettings.CheckCharacters = false;
- 51 //默认的xmlreader不读取内容中的回车换行\r\n
- 52 //(XmlReader rdr = XmlReader.Create(path,rsettings))
- 53 using (XmlTextReader rdr = new XmlTextReader(path))
- 54 {
- 55 rdr.WhitespaceHandling = WhitespaceHandling.Significant;
- 56 string eleName = "";
- 57 while (rdr.Read())
- 58 {
- 59 if (rdr.NodeType == XmlNodeType.Element)
- 60 {
- 61 //节点名称
- 62 eleName = rdr.Name;
- 63 //节点深度
- 64 int dp = rdr.Depth;
- 65 //是否空节点,表示<elememt/> 不是<element></element>
- 66 bool needend = rdr.IsEmptyElement;
- 67 for (int i = 0; i < rdr.AttributeCount; i++)
- 68 {
- 69 rdr.MoveToAttribute(i);
- 70 Console.WriteLine(rdr.Name+":"+rdr.Value);
- 71 }
- 72 //可以直接读取节点所有的数据.可以用readNode读取
- 73 //rdr.EOF判定,不然会跳过节点
- 74 //rdr.ReadOuterXml();
- 75 }
- 76 else if (rdr.NodeType == XmlNodeType.EndElement)
- 77 {
- 78 eleName = rdr.Name;
- 79 }
- 80 }
- 81 }
- 82 }
xmlreader和xmldocument(xmlwriter)组合一起用对大型xml进行拆分读取,十分有效。
下面是遇到的问题:
1.xmlwriter后xml文件头始终是utf-16

这是用StringBuilder才会有的问题,改用FileStream、MemoryStream等就好了。
2.(UTF8)改用MemoryStream后,形成的xml字符串通过XMLDocument.LoadXml时报错
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.UTF8;

最终发现默认的Encoding.UTF8是带有字节顺序标记的,要用new UTF8Encoding(false);
通过监视区代码可以看到,xmlstr[0]是65279,修改后就对了变成60'<'。


3.xmlreader默认不读取内容中的回车换行,读进来就是个空格。

第二个直接回车换行就是读不进来,用xmldocument可以读到两个,xmlreader就是读取不到。

期间一直在找设置,比如IgnoreWhitespace等,发现都没有用,还是不读。
XmlReaderSettings rsettings = new XmlReaderSettings();
rsettings.IgnoreWhitespace = false;
最后在stackoverflow上找到答案(注1),不能用XmlReader rdr = XmlReader.Create(path),用XmlTextReader就好了。

注1:不读回车换行问题 https://stackoverflow.com/questions/1793908/xmlreader-newline-n-instead-of-r-n
This is because the XmlTextReader has a normalization setting defaulted to false unlike XmlReader.Create which always normalizes newlines no matter what.