Researchers have created a user-friendly tool that allows users to conduct complex statistical analyses on tabular data with minimal effort. By combining probabilistic AI models with SQL, they have developed a method that delivers faster and more precise results compared to other techniques.
A novel tool simplifies the process of performing complex statistical analyses on tabular data without requiring users to delve into the underlying processes.
GenSQL, a generative AI system for databases, offers the capability to make predictions, identify anomalies, estimate missing values, rectify errors, or create synthetic data with just a few keystrokes.
For example, when applied to medical data from a patient with a history of high blood pressure, GenSQL might flag a blood pressure reading as low for that individual even though it would typically fall within the normal range.
GenSQL seamlessly merges a tabular dataset with a generative probabilistic AI model that can adjust decisions based on new data while considering uncertainties.
Furthermore, GenSQL can be utilized to generate and analyze synthetic data that closely resembles actual database information, which is particularly valuable in scenarios where sensitive data cannot be shared, such as in patient health records, or when authentic data is limited.
This innovative tool is built on the foundation of SQL, a programming language introduced in the late 1970s for the creation and management of databases, and commonly used by developers worldwide.
“Historically, SQL introduced the business world to the potential of computers. Instead of writing custom programs, users could simply ask database questions in a high-level language. As we transition from data querying to interrogating models and data, we need a language that guides users in asking meaningful questions to a computer armed with probabilistic data,” says Vikash Mansinghka ’05, MEng ’09, PhD ’09, the senior author behind the introduction of GenSQL, and a principal research scientist and leader of the Probabilistic Computing Project in the MIT Department of Brain and Cognitive Sciences.
Comparing GenSQL to mainstream AI-driven data analysis methods, the researchers found that GenSQL not only delivers faster outcomes but also enhances accuracy. Notably, the probabilistic models utilized by GenSQL are explainable, allowing users to comprehend and modify them.
“Analyzing data and identifying meaningful patterns based solely on simplistic statistical rules may overlook critical interactions. It’s essential to capture correlations and variable dependencies, which can be intricate, within a model. With GenSQL, our aim is to empower a broad user base to query their data and model without requiring exhaustive details,” adds lead author Mathieu Huot, a research scientist in the Department of Brain and Cognitive Sciences and a member of the Probabilistic Computing Project.
The paper includes contributions from Matin Ghavami and Alexander Lew, MIT graduate students; Cameron Freer, a research scientist; Ulrich Schaechtle and Zane Shelby of Digital Garage; Martin Rinard, an MIT professor in the Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Feras Saad ’15, MEng ’16, PhD ’22, an assistant professor at Carnegie Mellon University. The research was recently presented at the ACM Conference on Programming Language Design and Implementation.
Merging Models and Databases
SQL, which stands for structured query language, is a programming language tailored for storing and managing data within databases. In SQL, users can pose queries regarding data using specific keywords, such as summing, filtering, or grouping database records.
Nevertheless, leveraging a model can provide deeper insights, as models can interpret what data signifies for an individual. For instance, a female developer questioning her salary might be more interested in how salary data relates to her specifically than in general database trends.
Researchers observed that SQL lacked an efficient mechanism to include probabilistic AI models, while approaches incorporating probabilistic models for inferences didn’t support complex database queries.
To address this gap, they developed GenSQL, allowing users to interrogate both a dataset and a probabilistic model using a straightforward yet powerful formal programming language.
A GenSQL user uploads their data and probabilistic model, which are seamlessly integrated by the system. Consequently, the user can execute queries on data that also incorporate input from the underlying probabilistic model. This not only enables complex queries but also yields more precise responses.
For instance, a GenSQL query may ask, “How probable is it that a Seattle-based developer is proficient in the programming language Rust?” Solely analyzing correlations between database columns might overlook subtle dependencies, whereas integrating a probabilistic model can capture intricate interactions.
Moreover, GenSQL’s probabilistic models are traceable, allowing users to understand the data driving decision-making. Furthermore, these models provide calibrated uncertainty metrics along with every response.
For example, with calibrated uncertainty, when querying the model for projected outcomes of various cancer treatments for a patient from an underrepresented minority group in the dataset, GenSQL would transparently display uncertainty levels rather than inaccurately advocating for a treatment with excessive confidence.
Enhanced Speed and Precision
In assessing GenSQL, the researchers compared their system to prevalent baseline methods utilizing neural networks. GenSQL was found to be between 1.7 and 6.8 times faster than these methods, swiftly executing most queries within milliseconds while furnishing more precise outcomes.
The team also implemented GenSQL in two case studies: one involving the identification of mislabeled clinical trial data and the other generating precise synthetic data capturing complex genomic relationships.
Future endeavors include broadening the application of GenSQL to conduct extensive modeling of human populations. With GenSQL, the researchers can generate synthetic data for drawing inferences about aspects like health and salaries while regulating the information used in the analysis.
Additionally, the researchers aim to enhance the usability and capabilities of GenSQL by introducing new optimizations and automations to the system. Ultimately, they aspire to enable users to present queries in natural language with GenSQL, envisaging the development of an AI expert akin to ChatGPT that can address any database queries, substantiated by GenSQL interrogations.
This research is funded in part by the Defense Advanced Research Projects Agency (DARPA), Google, and the Siegel Family Foundation.