2023 | BIRD, a new cross-domain text-to-SQL benchmarks in the era of LLMs. | Paper |
Text-to-SQL has attracted attention from both the natural language processing and database communities because of its ability to convert the semantics in natural language into SQL queries and its practical application in building natural language interfaces to database systems.
The major challenges in text-to-SQL lie in encoding the meaning of natural utterances, decoding to SQL queries, and translating the semantics between these two forms. These challenges have been addressed to different extents by the recent advances. However, there is still a lack of comprehensive surveys for this task. To this end, we review recent progress on text-to-SQL for datasets, methods, and evaluation and provide this systematic survey, addressing the aforementioned challenges and discussing potential future directions. We hope this survey can serve as quick access to existing work and motivate future research. |
If you would like to cite our work, please cite the following work:
Naihao Deng, Yulong Chen, and Yue Zhang. 2022. Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect
@inproceedings{deng-etal-2022-recent-advances-in-text-to-sql,
title = "Recent Advances in Text-to-{SQL}: A Survey of What We Have and What We Expect",
author = "Deng, Naihao and
Chen, Yulong and
Zhang, Yue",
booktitle = "COLING",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/events/coling-2022/",
}
Topology for text-to-SQL. Format adapted from (Liu et al., 2021a) |
Datasets | #Size | #D | #DB | #T/DB | Issues addressed | Sources for data |
---|---|---|---|---|---|---|
Spider (Yu et al., 2018c) | 10,181 | 138 | 200 | 5.1 | Domain generalization | College courses, DatabaseAnswers, WikiSQL |
Spider-DK (Gan et al., 2021b) | 535 | - | 10 | 4.8 | Domain knowledge | Spider dev set |
Spider-tran (Zeng et al., 2020) | 15,023 | 138 | 200 | 5.1 | Untranslatable questions | Spider + 5,330 untranslatable questions |
Spider-L (Lei et al., 2020) | 8,034 | - | 160 | 5.1 | Schema linking | Spider train/dev |
SpiderSL (Taniguchi et al., 2021) | 1,034 | - | 10 | 4.8 | Schema linking | Spider dev set |
Spider-Syn (Gan et al., 2021a) | 8,034 | - | 160 | 5.1 | Robustness | Spider train/dev |
WikiSQL (Zhong et al., 2017) | 80,654 | - | 26,521 | 1 | Data size | Wikipedia |
Squall (Shi et al., 2020b) | 11,468 | - | 1,679 | 1 | Lexicon-level supervision | WikiTableQuestions (Pasupat and Liang, 2015) |
KaggleDBQA (Lee et al., 2021) | 272 | 8 | 8 | 2.3 | Domain generalization | Real web databases |
ATIS (Price, 1990), (Dahl et al., 1994) | 5,280 | 1 | 1 | 32 | - | Flight-booking |
GeoQuery (Zelle and Mooney, 1996) | 877 | 1 | 1 | 6 | - | US geography |
Scholar (Iyer et al., 2017) | 817 | 1 | 1 | 7 | - | Academic publications |
Academic (Li and Jagadish, 2014) | 196 | 1 | 1 | 15 | - | Microsoft Academic Search (MAS) |
IMDB (Yaghmazadeh et al., 2017) | 131 | 1 | 1 | 16 | - | database Internet Movie Database |
Yelp (Yaghmazadeh et al., 2017) | 128 | 1 | 1 | 7 | - | Yelp website |
Advising (Finegan-Dollak et al., 2018) | 3,898 | 1 | 1 | 10 | - | University of Michigan course |
Restaurants (Tang and Mooney, 2000) (Popescu et al., 2003) | 378 | 1 | 1 | 3 | - | information Restaurants |
MIMICSQL (Wang et al., 2020d) | 10,000 | 1 | 1 | 5 | - | Healthcare domain |
SEDE (Hazoom et al., 2021) | 12,023 | 1 | 1 | 29 | SQL template diversity | Stack Exchange |
- Chinest text-to-SQL:
- CSpider (Min et al., 2019a)
- TableQA (Sun et al., 2020)
- DuSQL (Wang et al., 2020c)
- ESQL (Chen et al., 2021a)
- Vietnamese text-to-SQL: ViText2SQL (Tuan Nguyen et al., 2020)
- Portuguese text-to-SQL: (José and Cozman, 2021)
- Multi-turn context-dependent text-to-SQL:
- ATIS (Price, 1990) (Dahl et al., 1994)
- Sparc (Yu et al., 2019b)
- CoSQL (Yu et al., 2019a)
- Unanswerable questions: TriageSQL (Zhang et al., 2020)
-
This Github Repo holds data for many of the single domain datasets including Academic, Advising, ATIS, Geography, IMDB, Restaurants, Scholar, Yelp; as well as cross-domain datasets including Spider and WikiSQL. The Github Repo is the official repo for This paper, and their methods of processing the data is described in the paper as well.
-
The official Spider leaderboard shows the performance of models on Spider dev and the hidden test set.
-
The official WikiSQL Github Repo holds leaderboard for models on WikiSQL dataset.
-
Typical effects:
- Handle complex or unseen questions (Zhong et al., 2020b), (Wang et al., 2021b)
- Achieve state-of-the-art with less supervised data (Guo et al., 2018)
- Attain robustness towards different types of questions (Radhakrishnan et al., 2020)
-
Data generation methods:
-
Quality control of the generated data:
-
Diversify the generated data:
Methods | Adopted by | Applied datasets | Addressed challenges |
---|---|---|---|
Encode token type | TypeSQL (Yu et al., 2018a) | WikiSQL | Representing question meaning |
Graph-based | GNN (Bogin et al., 2019a) | Spider | (1) Representing question and DB schemas in a structured way (2) Schema linking |
Global-GCN (Bogin et al., 2019b) | Spider | ||
IGSQL (Cai and Wan, 2020) | Sparc, CoSQL | ||
RAT-SQL (Wang et al., 2020a) | Spider | ||
LEGSQL (Cao et al., 2021) | Spider | ||
SADGA (Cai et al., 2021) | Spider | ||
ShawdowGNN (Chen et al., 2021b) | Spider | ||
S2SQL (Hui et al., 2022) | Spider, Spider-Syn | ||
Self-attention | X-SQL (He et al., 2019) | WikiSQL | |
SQLova (Hwang et al., 2019) | WikiSQL | ||
RAT-SQL (Wang et al., 2020a) | Spider | ||
DuoRAT (Scholak et al., 2021a) | Spider | ||
UnifiedSKG (Xie et al., 2022) | WikiSQL, Spider | ||
Adapt PLM | X-SQL (He et al., 2019) | WikiSQL | Leveraging external data to represent question and DB schemas |
SQLova (Hwang et al., 2019) | WikiSQL | ||
(Guo and Gao, 2019) | WikiSQL | ||
HydraNet (Lyu et al., 2020) | WikiSQL | ||
(Liu et al., 2021b), etc | Spider-L, SQUALL | ||
Pre-training | TaBERT (Yin et al., 2020) | Spider | |
GraPPA (Yu et al., 2021) | Spider | ||
GAP (Shi et al., 2020a) | Spider |
Methods | Adopted by | Applied datasets | Addressed challenges | |
---|---|---|---|---|
Tree-based | Seq2Tree (Dong and Lapata, 2016) | - | Hierarchical decoding | |
Seq2AST (Yin and Neubig, 2017) | - | |||
SyntaxSQLNet (Yu et al., 2018b) | Spider | |||
Sketch-based | SQLNet (Xu et al., 2017) | WikiSQL | ||
(Dong and Lapata, 2018) | WikiSQL | |||
IRNet (Guo et al., 2019) | Spider | |||
RYANSQL (Choi et al., 2021) | Spider | |||
Bottom-up | SmBop (Rubin and Berant, 2021) | Spider | ||
Attention Mechanism | Attention | Seq2Tree (Dong and Lapata, 2016) | - | Synthesizing information for decoding |
Seq2SQL (Zhong et al., 2017) | WikiSQL | |||
Bi-attention | (Guo and Gao, 2018) | WikiSQL | ||
Structured attention | (Wang et al., 2019) | WikiSQL | ||
Relation-aware Self-attention | DuoRAT (Scholak et al., 2021a) | Spider | ||
Copy Mechanism | Seq2AST (Yin and Neubig, 2017) | - | ||
Seq2SQL (Zhong et al., 2017) | WikiSQL | |||
(Wang et al., 2018a) | WikiSQL | |||
SeqGenSQL (Li et al., 2020a) | WikiSQL | |||
Intermediate Representation | IncSQL (Shi et al., 2018) | WikiSQL | Bridging the gap between natural language and SQL query | |
IRNet (Guo et al., 2019) | Spider | |||
(Suhr et al., 2020) | Spider and others♠ | |||
(Herzig et al., 2021) | GeoQuery, ATIS, Scholar | |||
(Gan et al., 2021c) | Spider | |||
(Brunner and Stockinger, 2021) | Spider | |||
Others | Constrained decoding | UniSAr (Dou et al., 2022) | WikiSQL, Spider and others♡ | Fine-grained decoding |
PICARD (Scholak et al., 2021b) | Spider, CoSQL | |||
Execution-guided | SQLova (Hwang et al., 2019) | WikiSQL | ||
(Wang et al., 2018b) | WikiSQL | |||
Discriminative re-ranking | Global-GCN (Bogin et al., 2019b) | Spider | SQL Ranking | |
(Kelkar et al., 2020) | Spider | |||
Separate submodule | SQLNet (Xu et al., 2017) | WikiSQL | Easier decoding | |
(Guo and Gao, 2018) | WikiSQL | |||
(Lee, 2019) | Spider | |||
BPE | (Muller and Vlachos, 2019) | Advising, ATIS, GeoQuery | ||
Link gating | (Chen et al., 2020b) | Spider | Synthesizing information for decoding |
-
Fully supervised
- Active learning (Ni et al., 2020)
- Interactive/Imitation learning (Yao et al., 2019) (Yao et al., 2020)
- Meta-learning (Huang et al., 2018) (Wang et al., 2021a) (Chen et al., 2021a)
- Multi-task learning (Chang et al., 2020) (Xuan et al., 2021) (Hui et al., 2021b) (Shi et al., 2021) (McCann et al., 2018) (Xie et al., 2022)
-
Weakly supervised
- Reinforcement learning (Zhong et al., 2017) (Liang et al., 2018)
- Meta-learning and Bayesian optimization (Agarwal et al., 2019)
- (Min et al., 2019b)
- DB linking:
- Model-wise:
- SQL generation:
For context-dependent text-to-SQL:
- Turn-level encoder and copy mechanism (Suhr et al., 2018) (Zhang et al., 2019) (Wang et al., 2020b)
- Constrained decoding (Wang et al., 2020b)
- Dynamic memory decay mechanism (Hui et al., 2021a)
- (Zheng et al., 2022)
Metrics | Datasets | Errors |
---|---|---|
Naiive Execution Accuracy | GeoQuery, IMDB, Yelp, WikiSQL, etc | False positive |
Exact String Match | Advising, WikiSQL, etc | False negative |
Exact Set Match | Spider | False negative |
Test Suite Accuracy (execution accuracy with generated databases) | Spider, GeoQuery, etc | False positive |
- Question split: split the question-SQL pairs randomly (Iyer et al., 2017).
- SQL query split: no SQL query is allowed to appear in more than one set among the train, dev, and test sets (Finegan-Dollak et al., 2018)
- Database split: databases in the test set do not appear in the training time (Yu et al., 2018c)
- Others (Shaw et al., 2021) (Chang et al., 2020)
-
Cross-domain text-to-SQL: incorporate domain knowledge to models trained on the existing datasets, and deploy such models efficiently on different domains.
-
Real-world use cases:
- Handle corrupted tables or scenarios where no table is provided.
- Handle user inputs different from the existing datasets.
- Facilitate DB administrator to manage DB schemas, updating DB content.
- Multi-lingual text-to-SQL.
- DB interface for the disabled.
-
Integrated into a larger scope of research:
- QA system for DB.
- Dialogue system with knowledge from DB.
- Explore the inter-relation between SQL and other logical forms.
- Generalized semantic parsing.
-
Others:
- Apply prompt learning to text-to-SQL: how to make the system robust.
- Evalulation of the existing text-to-SQL systems.
The domain for Restaurant dataset is restaurant information, where questions are typically about food type, restaurant location, etc.
There is a big difference in terms of how many tables a database has. For restaurants, there are 3 tables in the database, while there are 32 tables in ATIS (Suhr et al., 2020).
Geography, one of the 3 tables in Restaurants database. * denotes the primary key of this table. We only include 3 rows for demonstration purpose.CITY.NAME* | COUNTY | REGION |
---|---|---|
VARCHAR(255) | VARCHAR(255) | VARCHAR(255) |
Alameda | Alameda County | Bay Area |
Alamo | Contra Costa County | Bay Area |
Albany | Alameda County | Bay Area |
... | ... | ... |
Question:
Will undergrads be okay to take 581 ?
SQL query:
SELECT DISTINCT T1.ADVISORY_REQUIREMENT ,
T1.ENFORCED_REQUIREMENT , T1.NAME FROM
COURSE AS T1 WHERE T1.DEPARTMENT =
"EECS" AND T1.NUMBER = 581 ;
In Advising dataset, Department “EECS” is considered as domain knowledge where “581” in the utterance means a course in “EECS” department with course number “581”.
Question:
Give me some restaurants in alameda ?
SQL query:
SELECT T1.HOUSE_NUMBER ,
T2.NAME FROM LOCATION AS T1 , RESTAURANT
AS T2 WHERE T1.CITY_NAME = "alameda"
AND T2.ID = T1.RESTAURANT_ID ;
In Restaurants dataset, when the user queries “restaurants”, by dataset convention, the corresponding SQL query returns the column “HOUSE_NUMBER” and “NAME”.
An example of the template for text-to-SQL pair used by (Iyer et al., 2017) is as follows:
Question template:
Get all <ENT1>.<NAME> having <ENT2>.<COL1>.<NAME> as <ENT2>.<COL1>.<TYPE>
SQL query template:
SELECT <ENT1>.<DEF> FROM JOIN_FROM(
<ENT1>, <ENT2>) WHERE JOIN_WHERE(<ENT1>,
<ENT2>) AND
<ENT2>.<COL1> = <ENT2>.<COL1>.<TYPE> ;
Generated question:
Get all author having dataset as DATASET_TYPE
Generated SQL query:
SELECT author.authorId
FROM author , writes , paper ,
paperDataset , dataset WHERE author.
authorId = writes.authorId
AND writes.paperId = paper.paperId
AND paper.paperId = paperDataset.paperId
AND paperDataset.datasetId = dataset.
datasetId AND dataset.datasetName =
DATASET_TYPE ;
, where they populate the slots in the templates with table and column names from the database schema, as well as join the corresponding tables accordingly.
(Iyer et al., 2017) also uses PPDB (Ganitkevitch et al., 2013) to paraphrase the NL. An example of PPDB paraphrasing is thrown into jail
and imprisoned
.
This Github Repo also holds NL-SQL templates for Academic, Advising, ATIS, Geography, IMDB, Restaurants, Scholar, Yelp, where they mask the corresponding entities in the NL-SQL pairs.
(Yu et al., 2018c) defines the SQL hardness as the number of SQL components. The SQL query is harder when it contains more SQL keywords such as GROUP BY
and nested subqueries. Here are some examples from the original paper:
Easy:
SELECT COUNT(*)
FROM cars_data
WHERE cylinders > 4 ;
Medium:
SELECT T2.name, COUNT(*)
FROM concert AS T1 JOIN stadium AS T2 ON
T1.stadium_id = T2.stadium_id GROUP
BY T1.stadium_id ;
Hard:
SELECT T1.country_name
FROM countries AS T1 JOIN continents AS
T2 ON T1.continent = T2.cont_id JOIN
car_makers AS T3 ON T1.country_id = T3.
country
WHERE T2.continent = ’Europe’
GROUP BY T1.country_name
HAVING COUNT(*) >= 3 ;
Extra Hard:
SELECT AVG(life_expectancy) FROM country
WHERE name NOT IN
(SELECT T1.name
FROM country AS T1 JOIN
country_language AS T2
ON T1.code = T2.country_code
WHERE T2.language = "English"
AND T2.is_official = "T") ;
There is no qualitative measure of how hard the NL is. Intuitively, models’ performance can decrease when faced with longer questions from users. However, the information conveyed in longer sentences can be more complete, while there can be ambiguity in shorter sentences. Besides, there can be domain-specific phrases that confuse the model in both short and long utterances (Suhr et al., 2020). Thus, researchers need to consider various perspectives to determine the complexity of natural utterance.
This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here. |