ICode9

精准搜索请尝试: 精确搜索
首页 > 数据库> 文章详细

学习笔记(13)- decaNLP训练WikiSQL

2020-02-04 10:04:40  阅读:284  来源: 互联网

标签:13 United text jsonl decaNLP WikiSQL States wikisqltest


将自然语言转为sql语句,达到对话查询报表的效果。

参考资料

参考1

https://mp.weixin.qq.com/s/i7WAFjQHK1NGVACR8x3v0A

语义解析。SQL查询生成与语义解析相关。基于WikiSQL数据集的模型将自然语言问题转化成结构化的SQL查询,以便用户可以使用自然语言与数据库进行交互。WikiSQL通过逻辑形式精确匹配(lfEM)进行评估,以确保模型不会从错误生成的查询中获得正确的答案。

参考2

http://decanlp.com/

Semantic Parsing
Semantic parsing requires models to translate unstructured information into structured formats so that users can interact with structured information (e.g. a database) in natural language . decaNLP includes the WikiSQL dataset, which maps natural language questions into structured SQL queries.

参考3

https://github.com/salesforce/WikiSQL

安装

创建python虚拟环境
下载源码:
git clone https://github.com/salesforce/WikiSQL
cd WikiSQL
pip install -r requirements.txt
tar xvjf data.tar.bz2

数据

解压之后的数据文件目录:

.jsonl文件每行是一个json文件,

.db是SQLite3数据库格式。

查看db文件,可以从这里下载工具:https://github.com/pawelsalawa/sqlitestudio/releases/tag/3.2.1

问题、查询命令和表ID

文件/Users/huihui/git/WikiSQL/data/dev.jsonl

{
    "phase": 1,
    "table_id": "1-10015132-11",  
    "question": "What position does the player who played for butler cc (ks) play?", 
    "sql": {
        "sel": 3, 
        "conds": [
            [5, 0, "Butler CC (KS)"]
        ],
        "agg": 0
    }
}
  • phase: 数据集收集的阶段,在2个阶段收集WikiSQL。
  • table_id: 该问题所在的表ID。
  • question: 工作人员编写的自然语言问题。
  • sql: 该问题对应的SQL查询语句。有以下子字段:
    • sel: 列的下标。
    • agg: 聚合操作的下标。agg_ops = ['', 'MAX', 'MIN', 'COUNT', 'SUM', 'AVG']
    • conds: 三元组列表:
      • column_index: 列下标
      • operator_index: 操作符的下标。['=', '>', '<', 'OP']
      • condition: 条件的比较值,float或者string

可以进行max、min、count、sum、avg、大于小于等于、这些查询。

表文件

/Users/huihui/git/WikiSQL/data/dev.tables.jsonl

{
    "header": ["Player", "No.", "Nationality", "Position", "Years in Toronto", "School/Club Team"],
    "page_title": "Toronto Raptors all-time roster",
    "types": ["text", "text", "text", "text", "text", "text"],
    "id": "1-10015132-11",
    "section_title": "L",
    "caption": "L",
    "rows": [
        ["Antonio Lang", "21", "United States", "Guard-Forward", "1999-2000", "Duke"],
        ["Voshon Lenard", "2", "United States", "Guard", "2002-03", "Minnesota"],
        ["Martin Lewis", "32, 44", "United States", "Guard-Forward", "1996-97", "Butler CC (KS)"],
        ["Brad Lohaus", "33", "United States", "Forward-Center", "1996", "Iowa"],
        ["Art Long", "42", "United States", "Forward-Center", "2002-03", "Cincinnati"],
        ["John Long", "25", "United States", "Guard", "1996-97", "Detroit"],
        ["Kyle Lowry", "3", "United States", "Guard", "2012-Present", "Villanova"]
    ],
    "name": "table_10015132_11"
}

数据库db文件

表中列名用col0、col1等替代,目的是为了节省空间。

测试

测试的样例,可见文件test/example.pred.dev.jsonl

{
    "query": {
        "sel": 3,
        "agg": 0,
        "conds": [
            [5, 0, "butler cc (ks)"]
        ]
    },
    "seq": {
        "words": ["symselect", "symagg", "symcol", "position", "symwhere", "symcol", "school\/club", "team", "symop", "=", "symcond", "butler", "cc", "-lrb-", "ks", "-rrb-"],
        "after": [" ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", "", "", " "],
        "num": [1, 12, 4, 28, 2, 4, 32, 33, 9, 20, 10, 40, 41, 42, 43, 44],
        "gloss": ["SYMSELECT", "SYMAGG", "SYMCOL", "Position", "SYMWHERE", "SYMCOL", "School\/Club", "Team", "SYMOP", "=", "SYMCOND", "butler", "cc", "(", "ks", ")"]
    },
    "error": ""
}

提供了一个测试文件test/example.pred.dev.jsonl.bz2. 使用命令 bunzip2 test/example.pred.dev.jsonl.bz2 -k 进行解压。

提供了一个docker文件,打包了一些依赖文件,可以运行评估脚本。

首先在根目录构建镜像
docker build -t wikisqltest -f test/Dockerfile .
然后运行镜像文件
docker run --rm --name wikisqltest wikisqltest
如果一切运行正常,输入如下:
{
  "ex_accuracy": 0.5380596128725804,
  "lf_accuracy": 0.35375846099038116
}

我用了sudo
xuehp@haomeiya002:~/git/WikiSQL$ sudo docker build -t wikisqltest -f test/Dockerfile .
xuehp@haomeiya002:~/git/WikiSQL$ sudo docker run --rm --name wikisqltest wikisqltest

标签:13,United,text,jsonl,decaNLP,WikiSQL,States,wikisqltest
来源: https://www.cnblogs.com/xuehuiping/p/12258404.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有