c – 提升精神QI慢

2019-10-02 03:04:22 阅读：184 来源： 互联网

标签：boost-spirit-qi c csv parsing boost-spirit

我尝试使用Boost Spirit QI解析TPCH文件.
我的实现灵感来自Spirit QI(http://www.boost.org/doc/libs/1_52_0/libs/spirit/example/qi/employee.cpp)的员工示例.
数据采用csv格式,令牌以“|”分隔.字符.

它工作但很慢(1秒时为20秒).

这是我对lineitem文件的qi语法：

struct lineitem {
    int l_orderkey;
    int l_partkey;
    int l_suppkey;
    int l_linenumber;
    std::string l_quantity;
    std::string l_extendedprice;
    std::string l_discount;
    std::string l_tax;
    std::string l_returnflag;
    std::string l_linestatus;
    std::string l_shipdate;
    std::string l_commitdate;
    std::string l_recepitdate;
    std::string l_shipinstruct;
    std::string l_shipmode;
    std::string l_comment;
};

BOOST_FUSION_ADAPT_STRUCT( lineitem,
    (int, l_orderkey)
    (int, l_partkey)
    (int, l_suppkey)
    (int, l_linenumber)
    (std::string, l_quantity)
    (std::string, l_extendedprice)
    (std::string, l_discount)
    (std::string, l_tax)
    (std::string, l_returnflag)
    (std::string, l_linestatus)
    (std::string, l_shipdate)
    (std::string, l_commitdate)
    (std::string, l_recepitdate)
    (std::string, l_shipinstruct)
    (std::string, l_shipmode)
    (std::string, l_comment)) 

vector<lineitem>* lineitems=new vector<lineitem>();

phrase_parse(state->dataPointer,
    state->dataEndPointer,
    (*(int_ >> "|" >>
    int_ >> "|" >> 
    int_ >> "|" >>
    int_ >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> "|" >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' >>
    +(char_ - '|') >> '|' 
    ) ), space, *lineitems
);

问题似乎是字符解析.它比其他转换慢得多.
有没有更好的方法将可变长度标记解析为字符串？

解决方法:

我找到了解决问题的方法.如本文Boost Spirit QI grammar slow for parsing delimited strings所述
性能瓶颈是Spirit qi的字符串处理.所有其他数据类型似乎都非常快.

我通过自己处理数据而不是使用Spirit qi处理来避免这个问题.

我的解决方案使用一个帮助程序类,它为csv文件的每个字段提供函数.函数将值存储到结构中.字符串存储在char []中.命中解析器一个换行符,它调用一个函数,将结构添加到结果向量中.
Boost解析器调用此函数,而不是将值自身存储到向量中.

这是TCPH Benchmark的region.tbl文件的代码：

struct region{
    int r_regionkey;
    char r_name[25];
    char r_comment[152];
};

class regionStorage{
public:
regionStorage(vector<region>* regions) :regions(regions), pos(0) {}
void storer_regionkey(int const&i){
    currentregion.r_regionkey = i;
}

void storer_name(char const&i){
    currentregion.r_name[pos] = i;
    pos++;
}

void storer_comment(char const&i){
    currentregion.r_comment[pos] = i;
    pos++;
}

void resetPos() {
    pos = 0;
}

void endOfLine() {
    pos = 0;
    regions->push_back(currentregion);
}

private:
vector<region>* regions;
region currentregion;
int pos;
};


void parseRegion(){

    vector<region> regions;
    regionStorage regionstorageObject(&regions);
    phrase_parse(dataPointer, /*< start iterator >*/    
     state->dataEndPointer, /*< end iterator >*/
     (*(lexeme[
     +(int_[boost::bind(&regionStorage::storer_regionkey, &regionstorageObject, _1)] - '|') >> '|' >>
     +(char_[boost::bind(&regionStorage::storer_name, &regionstorageObject, _1)] - '|') >> char_('|')[boost::bind(&regionStorage::resetPos, &regionstorageObject)] >>
     +(char_[boost::bind(&regionStorage::storer_comment, &regionstorageObject, _1)] - '|') >> char_('|')[boost::bind(&regionStorage::endOfLine, &regionstorageObject)]
    ])), space);

   cout << regions.size() << endl;
}

它不是一个漂亮的解决方案,但它可以工作,而且速度更快. (对于1 GB TCPH数据,2.2秒,多线程)

标签：boost-spirit-qi,c,csv,parsing,boost-spirit
来源： https://codeday.me/bug/20191002/1840817.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

c – 提升精神QI慢