c# – 有效地使用正则表达式解析StreamReader

2019-06-20 22:55:21 阅读：280 来源： 互联网

标签：c multithreading parallel-processing streamreader

我有变量

    StreamReader DebugInfo = GetDebugInfo();
    var text = DebugInfo.ReadToEnd();  // takes 10 seconds!!! because there are a lot of students

文字等于：

<student>
    <firstName>Antonio</firstName>
    <lastName>Namnum</lastName>
</student>
<student>
    <firstName>Alicia</firstName>
    <lastName>Garcia</lastName>
</student>
<student>
    <firstName>Christina</firstName>
    <lastName>SomeLattName</lastName>
</student>
... etc
.... many more students

我现在在做什么：

  StreamReader DebugInfo = GetDebugInfo();
  var text = DebugInfo.ReadToEnd(); // takes 10 seconds!!!

  var mtch = Regex.Match(text , @"(?s)<student>.+?</student>");
  // keep parsing the file while there are more students
  while (mtch.Success)
  {
     AddStudent(mtch.Value); // parse text node into object and add it to corresponding node
     mtch = mtch.NextMatch();
  }

整个过程大约需要25秒.将streamReader转换为需要10秒的文本(var text = DebugInfo.ReadToEnd();).另一部分大约需要15秒.我希望我可以同时做这两个部分……

编辑

我希望有类似的东西：

    const int bufferSize = 1024;

    var sb = new StringBuilder();

    Task.Factory.StartNew(() =>
    {
         Char[] buffer = new Char[bufferSize];
         int count = bufferSize;

         using (StreamReader sr = GetUnparsedDebugInfo())
         {

             while (count > 0)
             {
                 count = sr.Read(buffer, 0, bufferSize);
                 sb.Append(buffer, 0, count);
             }
         }

         var m = sb.ToString();
     });

     Thread.Sleep(100);

     // meanwhile string is being build start adding items

     var mtch = Regex.Match(sb.ToString(), @"(?s)<student>.+?</student>"); 

     // keep parsing the file while there are more nodes
     while (mtch.Success)
     {
         AddStudent(mtch.Value);
         mtch = mtch.NextMatch();
     }

编辑2

摘要

我忘了提到抱歉文本与xml非常相似,但事实并非如此.这就是为什么我必须使用正则表达式…简而言之,我认为我可以节省时间,因为我正在做的是将流转换为字符串然后解析字符串.为什么不用正则表达式解析流.或者,如果这不可能,为什么不获取流的一部分并在单独的线程中解析该块.

解决方法:

更新：

这个基本代码在.75秒内读取(大致)20兆字节的文件.我的机器应该在你引用的2秒内大致处理53.33兆字节.此外,20,000,000 / 2,048 = 9765.625. .75 / 9765.625 = .0000768.这意味着你大约每768千分之一秒读2048个字符.您需要了解与迭代时间相关的上下文切换成本,以确定增加的多线程复杂性是否合适.在7.68X10 ^ 5秒,我看到你的读卡器线程大多数时间处于空闲状态.这对我来说没有意义.只需使用单个线程的单个循环.

char[] buffer = new char[2048];
StreamReader sr = new StreamReader(@"C:\20meg.bin");
while(sr.Read(buffer, 0, 2048) != 0)
{
    ; // do nothing
}

对于像这样的大型操作,您希望使用仅向前的非缓存读取器.看起来您的数据是XML,因此XmlTextReader非常适合这种情况.这是一些示例代码.希望这可以帮助.

string firstName;
        string lastName;
        using (XmlTextReader reader = GetDebugInfo())
        {
            while (reader.Read())
            {
                if (reader.IsStartElement() && reader.Name == "student")
                {
                    reader.ReadToDescendant("firstName");
                    reader.Read();
                    firstName = reader.Value;
                    reader.ReadToFollowing("lastName");
                    reader.Read();
                    lastName = reader.Value;
                    AddStudent(firstName, lastName);
                }
            }
        }

我使用了以下XML：

<students>
    <student>
        <firstName>Antonio</firstName>
        <lastName>Namnum</lastName>
    </student>
    <student>
        <firstName>Alicia</firstName>
        <lastName>Garcia</lastName>
    </student>
    <student>
        <firstName>Christina</firstName>
        <lastName>SomeLattName</lastName>
    </student>
</students>

您可能需要调整.这应该运行得更快,更快.

标签：c,multithreading,parallel-processing,streamreader
来源： https://codeday.me/bug/20190620/1249066.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

c# – 有效地使用正则表达式解析StreamReader