NLP Query to SQL Query with GPT: Data Extraction for Businesses

Have you ever struggled with extracting useful information from a large database? Maybe you wanted to find out how many customers bought a certain product last month, or what the total revenue was for a specific time period. It can be a daunting task to manually search through all the data and compile the results. Fortunately, with recent advancements in natural language processing (NLP), machines can now understand and respond to human language, making it easier than ever to query databases using natural language commands. This is where ChatGPT comes in. In this post, we will build a proof of concept application to build a NLP query to SQL query using OpenAi’s GPT model.

What is Natural Language Processing (NLP)?

Natural Language Processing, or NLP, is a branch of artificial intelligence that focuses on enabling machines to understand and interact with human language. In simpler terms, NLP is the ability of machines to read, understand, and generate human language. NLP allows machines to process and analyze vast amounts of natural language data, such as text, speech, and even gestures, and converts them into structured data that is used for analysis and decision-making, through a combination of algorithms, machine learning, and linguistics. For example, a machine using NLP might analyze a text message and identify the sentiment behind it, such as whether the message is positive, negative, or neutral. Or it might identify key topics or entities mentioned in the message, such as people, places, or products.

How Does NLP Work?

NLP uses a combination of algorithms, statistical models, and machine learning to analyze and understand human language. Below are the basic steps involved in the NLP process:

Tokenization: The first step in NLP is to tokenize the data. The text is broken down into pieces of text or speech into individual units, or tokens, such as words, phrases, or sentences.
Parsing: This process involves analyzing the grammatical structure of the text to identify the relationships between the tokens. This helps the machine understand the meaning of the text.
Named entity recognition: NER is the process of identifying and classifying named entities in text, such as people, places, and organizations. This helps the machine understand the context of the text and the relationships between different entities.
Sentiment analysis: Sentiment analysis involves determining the overall sentiment or emotional tone of a piece of text, such as whether it is positive, negative, or neutral. Many social media companies leverage this for monitoring, customer feedback analysis, and other applications.
Machine learning: NLP algorithms are trained using machine learning techniques to improve their accuracy and performance over time. By analyzing large amounts of human language data, the machine can learn to recognize patterns and make predictions about new text it encounters.

What is ChatGPT?

ChatGPT is a powerful language model based on the GPT-3.5 architecture that can generate human-like responses to natural language queries. This means that you can interact with ChatGPT in the same way you would with a human, using plain language to ask questions or give commands. But instead of relying on intuition and experience to retrieve data, ChatGPT uses its NLP capabilities to translate your natural language query into a structured query language (SQL) that can then be used to extract data from a database.

So how does this work? Let’s say you have a database of customer orders, and you want to find out how many orders were placed in the month of March. You could ask ChatGPT something like “How many orders were placed in March?” ChatGPT would then use its NLP capabilities to understand the intent of your query, and translate it into a SQL query that would retrieve the relevant data from the database. The resulting SQL query might look something like this:

SELECT COUNT(*) FROM orders WHERE order_date >= '2022-03-01' AND order_date < '2022-04-01';

This SQL query would retrieve the number of rows (orders) where the order date falls within the month of March, and return the count of those rows. Executives who desire to have these results traditionally rely on skilled database administrators to craft the desired query. These DBA’s then need to validate that the data meets the needs and requirements that were requested. This is a time consuming process as the requests can be much more complex than the example above.

Benefits of Leveraging ChatGPT

Using ChatGPT to extract insights from databases can provide numerous benefits to businesses. Here are some of the key advantages:

Faster decision-making: By using ChatGPT to quickly and easily retrieve data from databases, businesses can make more informed decisions in less time. This improved velocity is especially valuable in fast-paced industries where decisions need to be made quickly.
Increased efficiency: ChatGPT’s ability to extract data from databases means that employees can spend less time manually searching for and compiling data, and more time analyzing and acting on the insights generated from that data. This can lead to increased productivity and efficiency.
Better insights: ChatGPT helps businesses uncover insights that may have been overlooked or difficult to find using traditional data analysis methods. Leveraging NLP to generate natural language queries, ChatGPT helps users explore data in new ways and uncover insights that may have been hidden.
Improved collaboration: Because ChatGPT can be used by anyone in the organization, regardless of their technical expertise, it can help foster collaboration and communication across departments. This can help break down silos and promote a culture of data-driven decision-making throughout the organization.
Easy-to-understand data: ChatGPT can help executives easily access and understand data in a way that is intuitive and natural. This enables the use of plain language to ask questions or give commands, and ChatGPT will generate SQL queries that extract the relevant data from the database. This means that executives can quickly access the information they need without having to rely on technical jargon or complex reports.

Building a NLP Query to SQL Query GPT Application

Before we get started, it is important to note that this is simply a proof of concept application. We will be building a simple application to convert a natural language query into an SQL query to extract sales data from an SQL database. Since it is simply a proof of concept, we will be using a SQL database in memory. In production, you would want to connect directly to the enterprise database.

This project can be found on my GitHub.

The first step for developing this application is to ensure you have an API key from OpenAPI.

Obtaining an API Key from OpenAi

To get a developer API key from OpenAI, you need to sign up for an API account on the OpenAI website. Here’s a step-by-step guide to help you with that process:

Visit the OpenAI website
Click on the “Sign up” button in the top-right corner of the page to create an account. If you already have an account, click on “Log in” instead.
Once you’ve signed up or logged in, visit the OpenAI API portal
Fill in the required details and sign up for the API. If you’re already logged in, the signup process might be quicker.
After signing up, you’ll get access to the OpenAI API dashboard. You may need to wait for an email confirmation or approval before you can use the API.
Once you have access to the API dashboard, navigate to the “API Keys” tab
Click on “Create new API key” to generate a new API key. You can also see any existing keys you have on this page.

IMPORTANT: Make sure you keep your API key secure, as it is a sensitive piece of information that can be used to access your account and make requests on your behalf. Don’t share it publicly or include it in your code directly. Store it in a separate file or use environment variables to keep it secure.

Step 1: Development Environment

This project was created using Jupyter notebook. You can install Jupyter locally as a standalone program on your device. To learn how to install Jupyter, visit their website here. Jupyter also comes installed on Anaconda and you can use the notebook there. To learn more about Anaconda, visit their documentation here. Lastly, you can use Google Colab to develop. Google Colab, short for Google Colaboratory, is a free, cloud-based Jupyter Notebook environment provided by Google. It allows users to write, execute, and share code in Python and other supported languages, all within a web browser. You can start using Google Colab by visiting here.

Note: You must have a Google account to use this service.

Step 2: Importing Your Libraries

For this project, the following Python libraries were used:

OpenAi (see the documentation here)
OS (see the documentation here)
Pandas (see documentation here)
SQLAlchemy (see documentation here)

Copy Code


#Import Libraries
import openai
import os
import pandas as pd
import sqlalchemy

#Import these libraries to setup a temp DB in RAM and PUSH Pandas DF to DB
from sqlalchemy import create_engine
from sqlalchemy import text

Step 3: Connecting Your API Key to OpenAi

For this project, I have created a text file to pass my API key to avoid having to hard code my key into my code. We could have set it up as an environment variable, but we would need to associate the key each time we begin a new session. This is not ideal. It is important to note that the text file must be in the same directory as the notebook to use this method.

Copy Code


#Pass api.txt file
with open('api.txt', 'r') as f:
    openai.api_key = f.read().strip()

Step 4: Evaluate the Data

Next, we will use the pandas library to evaluate the data. We start by creating a dataframe from the dataset and reviewing the first five rows.

Copy Code


#Read in data
df = pd.read_csv("sales_data_sample.csv")

#Review data
df.head()

Step 5: Create the In-Memory SQLite Database

This code snippet creates a SQLAlchemy engine that connects to an in-memory SQLite database. Here’s a breakdown of each part:

create_engine: This is a function from SQLAlchemy that creates an engine object, which establishes a connection to a specific database.
'sqlite:///memory:': This is a connection string that specifies the database type (SQLite) and its location (in-memory). The triple forward slash (///) is used to denote an in-memory SQLite database.
echo=True: This is an optional argument that, when set to True, enables logging of generated SQL statements to the console. It can be helpful for debugging purposes.

Copy Code


#Create temp DB
temp_db = create_engine('sqlite:///memory:', echo = True)

Step 6: Pushing the Dataframe to the Database Created Above

In this step, we will use the to_sql method from the pandas library to push the contents of a DataFrame (df) to a new SQL table in the connected database.

Copy Code


#Push the DF to be in SQL DB
data = df.to_sql(name = "sales_table", con = temp_db)

Step 7: Connecting to the Database

This code snippet connects to the database using the SQLAlchemy engine (temp_db) and executes a SQL query to get the sum of the SALES column from the Sales table. We will also review the output. Here’s a breakdown of the code:

with temp_db.connect() as conn:: This creates a context manager that connects to the database using the temp_db engine. It assigns the connection to the variable conn. The connection will be automatically closed when the with block ends.
results = conn.execute(text("SELECT SUM(SALES) FROM Sales")): This line executes a SQL query using the conn.execute() method. The text() function is used to wrap the raw SQL query string, which is "SELECT SUM(SALES) FROM Sales". The query calculates the sum of the SALES column from the Sales table. The result of the query is stored in the results variable.

Copy Code


#Connect to SQL DB
with temp_db.connect() as conn:
    results = conn.execute(text("SELECT SUM(SALES) FROM Sales"))

#Return Results
results.all()

Step 8: Create the Handler Functions for GPT-3 to Understand the Table Structure

This code snippet defines a Python function called create_table_definition that takes a pandas DataFrame (df) as input and returns a string containing a formatted comment about an SQLite SQL table named Sales with its columns.

Copy Code


# SQLite SQL tables with their properties:
# -----------------------------------------
# Employee (ID, Name, Department_ID)
# Department (ID, Name, Address)
# Salary_Payments (ID, Employee_ID, Amount, Date)
# -----------------------------------------
#Create a function for table definitions
def create_table_definition(df):
    prompt = """### sqlite SQL table, with its properties:
    #
    # Sales({})
    #
    """.format(",".join(str(col) for col in df.columns))
    
    return prompt

To review the output:

Copy Code


#Review results
print(create_table_definition(df))

Step 9: Create the Prompt Function for NLP

Copy Code


#Prompt Function
def prompt_input():
    nlp_text = input("Enter desired information: ")
    return nlp_text

#Validate function
prompt_input()

Step 10: Combining the Functions

This function defines a Python function called combined that takes a pandas DataFrame (df) and a string (query_prompt) as input and returns a combined string containing a formatted comment about the SQLite SQL table and a query prompt.

Copy Code


#Combine these functions into a single function
def combined(df, query_prompt):
    definition = create_table_definition(df)
    query_init_string = f"###A query to answer: {query_prompt}\nSELECT"
    return definition + query_init_string

Here, we grab the NLP input and insert the table definitions.:

Copy Code


#Grabbing natural language
nlp_text = prompt_input()

#Inserting table definition (DF + query that does... + NLP)
prompt = combined(df, nlp_text)

Step 11: Generating the Response from the GPT-3 Language Model

This code snippet calls the openai.Completion.create() method from the OpenAI API to generate a response using the GPT-3 language model. The specific model used here is ‘text-davinci-002’. The prompt for the model is generated using the combined(df, nlp_text) function, which combines a comment describing the SQLite SQL table (based on the DataFrame df) and a comment describing the SQL query to be written. Here’s a breakdown of the method parameters:

model='text-davinci-002': Specifies the GPT-3 model to be used for generating the response, in this case, ‘text-davinci-002’.
prompt=combined(df, nlp_text): The prompt for the model is generated by calling the combined() function with the DataFrame df and the string nlp_text as inputs.
temperature=0: Controls the randomness of the model’s output. A value of 0 makes the output deterministic, selecting the most likely token at each step.
max_tokens=150: Limits the maximum number of tokens (words or word pieces) in the generated response to 150.
top_p=1.0: Controls the nucleus sampling, which keeps the probability mass for the top tokens whose cumulative probability exceeds the specified value (1.0 in this case). A value of 1.0 includes all tokens in the sampling, so it is effectively equivalent to using greedy decoding.
frequency_penalty=0: Controls the penalty applied based on token frequency. A value of 0 means no penalty is applied.
presence_penalty=0: Controls the penalty applied based on token presence in the input. A value of 0 means no penalty is applied.
stop=["#", ";"]: Specifies a list of tokens that, if encountered by the model, will cause the generation to stop. In this case, the generation will stop when it encounters a “#” or “;”.

The openai.Completion.create() method returns a response object, which is stored in the response variable. The generated text can be extracted from this object using response.choices[0].text.

Copy Code


#Generate GPT Response
response = openai.Completion.create(
            model = 'text-davinci-002',
            prompt = combined (df, nlp_text),
            temperature = 0,
            max_tokens = 150,
            top_p = 1.0,
            frequency_penalty = 0,
            presence_penalty = 0,
            stop = ["#", ";"]
)

Step 12: Format the Response

Finally, we right a function to format the response from the GPT application:

Copy Code


#Format response
def handle_response(response):
    query = response['choices'][0]['text']
    if query.startswith(" "):
        query = 'SELECT' + query
    return query

Running the following snippet will return the desired NLP query to SQL query input:

Copy Code


#Get response
handle_response(response)

Your output should now look something like this:

"SELECT * FROM Sales WHERE STATUS = 'Shipped' AND YEAR_ID = 2003 AND QTR_ID = 3\n

In this post, we demonstrated a very simplistic way to take a NLP query to SQL query using an in-memory SQL database. This was a simple proof of concept. In future posts, we will expand this application to show more enterprise ready applications such as incorporating this into PowerBI and connecting to a production ready database which is more reflective of a real world application.

By Adam DiStefano, M.S., CISSPApril 18, 2023Artificial Intelligence, ChatGPT, GPT, GPT Prompt, Natural Language Processing, NLP, NLP to SQLLeave a Comment

The Official Blog of Adam DiStefano, M.S., CEH, CISSP, CCSK, CAISS

NLP Query to SQL Query with GPT: Data Extraction for Businesses

What is Natural Language Processing (NLP)?

How Does NLP Work?

What is ChatGPT?

Benefits of Leveraging ChatGPT

Building a NLP Query to SQL Query GPT Application

Obtaining an API Key from OpenAi

Step 1: Development Environment

Step 2: Importing Your Libraries

Step 3: Connecting Your API Key to OpenAi

Step 4: Evaluate the Data

Step 5: Create the In-Memory SQLite Database

Step 6: Pushing the Dataframe to the Database Created Above

Step 7: Connecting to the Database

Step 8: Create the Handler Functions for GPT-3 to Understand the Table Structure

Step 9: Create the Prompt Function for NLP

Step 10: Combining the Functions

Step 11: Generating the Response from the GPT-3 Language Model

Step 12: Format the Response

Leave a Reply Cancel reply