Have you ever struggled with extracting useful information from a large database? Maybe you wanted to find out how many customers bought a certain product last month, or what the total revenue was for a specific time period. It can be a daunting task to manually search through all the data and compile the results. Fortunately, with recent advancements in natural language processing (NLP), machines can now understand and respond to human language, making it easier than ever to query databases using natural language commands. This is where ChatGPT comes in. In this post, we will build a proof of concept application to build a NLP query to SQL query using OpenAi’s GPT model.
What is Natural Language Processing (NLP)?
Natural Language Processing, or NLP, is a branch of artificial intelligence that focuses on enabling machines to understand and interact with human language. In simpler terms, NLP is the ability of machines to read, understand, and generate human language. NLP allows machines to process and analyze vast amounts of natural language data, such as text, speech, and even gestures, and converts them into structured data that is used for analysis and decision-making, through a combination of algorithms, machine learning, and linguistics. For example, a machine using NLP might analyze a text message and identify the sentiment behind it, such as whether the message is positive, negative, or neutral. Or it might identify key topics or entities mentioned in the message, such as people, places, or products.
How Does NLP Work?
NLP uses a combination of algorithms, statistical models, and machine learning to analyze and understand human language. Below are the basic steps involved in the NLP process:
- Tokenization: The first step in NLP is to tokenize the data. The text is broken down into pieces of text or speech into individual units, or tokens, such as words, phrases, or sentences.
- Parsing: This process involves analyzing the grammatical structure of the text to identify the relationships between the tokens. This helps the machine understand the meaning of the text.
- Named entity recognition: NER is the process of identifying and classifying named entities in text, such as people, places, and organizations. This helps the machine understand the context of the text and the relationships between different entities.
- Sentiment analysis: Sentiment analysis involves determining the overall sentiment or emotional tone of a piece of text, such as whether it is positive, negative, or neutral. Many social media companies leverage this for monitoring, customer feedback analysis, and other applications.
- Machine learning: NLP algorithms are trained using machine learning techniques to improve their accuracy and performance over time. By analyzing large amounts of human language data, the machine can learn to recognize patterns and make predictions about new text it encounters.
What is ChatGPT?
SELECT COUNT(*) FROM orders WHERE order_date >= '2022-03-01' AND order_date < '2022-04-01';
This SQL query would retrieve the number of rows (orders) where the order date falls within the month of March, and return the count of those rows. Executives who desire to have these results traditionally rely on skilled database administrators to craft the desired query. These DBA’s then need to validate that the data meets the needs and requirements that were requested. This is a time consuming process as the requests can be much more complex than the example above.
Benefits of Leveraging ChatGPT
Using ChatGPT to extract insights from databases can provide numerous benefits to businesses. Here are some of the key advantages:
- Faster decision-making: By using ChatGPT to quickly and easily retrieve data from databases, businesses can make more informed decisions in less time. This improved velocity is especially valuable in fast-paced industries where decisions need to be made quickly.
- Increased efficiency: ChatGPT’s ability to extract data from databases means that employees can spend less time manually searching for and compiling data, and more time analyzing and acting on the insights generated from that data. This can lead to increased productivity and efficiency.
- Better insights: ChatGPT helps businesses uncover insights that may have been overlooked or difficult to find using traditional data analysis methods. Leveraging NLP to generate natural language queries, ChatGPT helps users explore data in new ways and uncover insights that may have been hidden.
- Improved collaboration: Because ChatGPT can be used by anyone in the organization, regardless of their technical expertise, it can help foster collaboration and communication across departments. This can help break down silos and promote a culture of data-driven decision-making throughout the organization.
- Easy-to-understand data: ChatGPT can help executives easily access and understand data in a way that is intuitive and natural. This enables the use of plain language to ask questions or give commands, and ChatGPT will generate SQL queries that extract the relevant data from the database. This means that executives can quickly access the information they need without having to rely on technical jargon or complex reports.
Building a NLP Query to SQL Query GPT Application
Before we get started, it is important to note that this is simply a proof of concept application. We will be building a simple application to convert a natural language query into an SQL query to extract sales data from an SQL database. Since it is simply a proof of concept, we will be using a SQL database in memory. In production, you would want to connect directly to the enterprise database.
This project can be found on my GitHub.
The first step for developing this application is to ensure you have an API key from OpenAPI.
Obtaining an API Key from OpenAi
To get a developer API key from OpenAI, you need to sign up for an API account on the OpenAI website. Here’s a step-by-step guide to help you with that process:
- Visit the OpenAI website
- Click on the “Sign up” button in the top-right corner of the page to create an account. If you already have an account, click on “Log in” instead.
- Once you’ve signed up or logged in, visit the OpenAI API portal
- Fill in the required details and sign up for the API. If you’re already logged in, the signup process might be quicker.
- After signing up, you’ll get access to the OpenAI API dashboard. You may need to wait for an email confirmation or approval before you can use the API.
- Once you have access to the API dashboard, navigate to the “API Keys” tab
- Click on “Create new API key” to generate a new API key. You can also see any existing keys you have on this page.
IMPORTANT: Make sure you keep your API key secure, as it is a sensitive piece of information that can be used to access your account and make requests on your behalf. Don’t share it publicly or include it in your code directly. Store it in a separate file or use environment variables to keep it secure.
Step 1: Development Environment
This project was created using Jupyter notebook. You can install Jupyter locally as a standalone program on your device. To learn how to install Jupyter, visit their website here. Jupyter also comes installed on Anaconda and you can use the notebook there. To learn more about Anaconda, visit their documentation here. Lastly, you can use Google Colab to develop. Google Colab, short for Google Colaboratory, is a free, cloud-based Jupyter Notebook environment provided by Google. It allows users to write, execute, and share code in Python and other supported languages, all within a web browser. You can start using Google Colab by visiting here.
Note: You must have a Google account to use this service.
Step 2: Importing Your Libraries
For this project, the following Python libraries were used:
- OpenAi (see the documentation here)
- OS (see the documentation here)
- Pandas (see documentation here)
- SQLAlchemy (see documentation here)
#Import Libraries
import openai
import os
import pandas as pd
import sqlalchemy
#Import these libraries to setup a temp DB in RAM and PUSH Pandas DF to DB
from sqlalchemy import create_engine
from sqlalchemy import text
Step 3: Connecting Your API Key to OpenAi
For this project, I have created a text file to pass my API key to avoid having to hard code my key into my code. We could have set it up as an environment variable, but we would need to associate the key each time we begin a new session. This is not ideal. It is important to note that the text file must be in the same directory as the notebook to use this method.
#Pass api.txt file
with open('api.txt', 'r') as f:
openai.api_key = f.read().strip()
Step 4: Evaluate the Data
Next, we will use the pandas library to evaluate the data. We start by creating a dataframe from the dataset and reviewing the first five rows.
#Read in data
df = pd.read_csv("sales_data_sample.csv")
#Review data
df.head()
Step 5: Create the In-Memory SQLite Database
This code snippet creates a SQLAlchemy engine that connects to an in-memory SQLite database. Here’s a breakdown of each part:
create_engine
: This is a function from SQLAlchemy that creates an engine object, which establishes a connection to a specific database.'sqlite:///memory:'
: This is a connection string that specifies the database type (SQLite) and its location (in-memory). The triple forward slash (///
) is used to denote an in-memory SQLite database.echo=True
: This is an optional argument that, when set toTrue
, enables logging of generated SQL statements to the console. It can be helpful for debugging purposes.
#Create temp DB
temp_db = create_engine('sqlite:///memory:', echo = True)
Step 6: Pushing the Dataframe to the Database Created Above
In this step, we will use the to_sql
method from the pandas library to push the contents of a DataFrame (df
) to a new SQL table in the connected database.
#Push the DF to be in SQL DB
data = df.to_sql(name = "sales_table", con = temp_db)
Step 7: Connecting to the Database
This code snippet connects to the database using the SQLAlchemy engine (temp_db
) and executes a SQL query to get the sum of the SALES
column from the Sales
table. We will also review the output. Here’s a breakdown of the code:
with temp_db.connect() as conn:
: This creates a context manager that connects to the database using thetemp_db
engine. It assigns the connection to the variableconn
. The connection will be automatically closed when thewith
block ends.results = conn.execute(text("SELECT SUM(SALES) FROM Sales"))
: This line executes a SQL query using theconn.execute()
method. Thetext()
function is used to wrap the raw SQL query string, which is"SELECT SUM(SALES) FROM Sales"
. The query calculates the sum of theSALES
column from theSales
table. The result of the query is stored in theresults
variable.
#Connect to SQL DB
with temp_db.connect() as conn:
results = conn.execute(text("SELECT SUM(SALES) FROM Sales"))
#Return Results
results.all()
Step 8: Create the Handler Functions for GPT-3 to Understand the Table Structure
This code snippet defines a Python function called create_table_definition
that takes a pandas DataFrame (df
) as input and returns a string containing a formatted comment about an SQLite SQL table named Sales
with its columns.
# SQLite SQL tables with their properties:
# -----------------------------------------
# Employee (ID, Name, Department_ID)
# Department (ID, Name, Address)
# Salary_Payments (ID, Employee_ID, Amount, Date)
# -----------------------------------------
#Create a function for table definitions
def create_table_definition(df):
prompt = """### sqlite SQL table, with its properties:
#
# Sales({})
#
""".format(",".join(str(col) for col in df.columns))
return prompt
To review the output:
#Review results
print(create_table_definition(df))
Step 9: Create the Prompt Function for NLP
#Prompt Function
def prompt_input():
nlp_text = input("Enter desired information: ")
return nlp_text
#Validate function
prompt_input()
Step 10: Combining the Functions
This function defines a Python function called combined
that takes a pandas DataFrame (df
) and a string (query_prompt
) as input and returns a combined string containing a formatted comment about the SQLite SQL table and a query prompt.
#Combine these functions into a single function
def combined(df, query_prompt):
definition = create_table_definition(df)
query_init_string = f"###A query to answer: {query_prompt}\nSELECT"
return definition + query_init_string
Here, we grab the NLP input and insert the table definitions.:
#Grabbing natural language
nlp_text = prompt_input()
#Inserting table definition (DF + query that does... + NLP)
prompt = combined(df, nlp_text)
Step 11: Generating the Response from the GPT-3 Language Model
openai.Completion.create()
method from the OpenAI API to generate a response using the GPT-3 language model. The specific model used here is ‘text-davinci-002’. The prompt for the model is generated using the combined(df, nlp_text)
function, which combines a comment describing the SQLite SQL table (based on the DataFrame df
) and a comment describing the SQL query to be written. Here’s a breakdown of the method parameters:model='text-davinci-002'
: Specifies the GPT-3 model to be used for generating the response, in this case, ‘text-davinci-002’.prompt=combined(df, nlp_text)
: The prompt for the model is generated by calling thecombined()
function with the DataFramedf
and the stringnlp_text
as inputs.temperature=0
: Controls the randomness of the model’s output. A value of 0 makes the output deterministic, selecting the most likely token at each step.max_tokens=150
: Limits the maximum number of tokens (words or word pieces) in the generated response to 150.top_p=1.0
: Controls the nucleus sampling, which keeps the probability mass for the top tokens whose cumulative probability exceeds the specified value (1.0 in this case). A value of 1.0 includes all tokens in the sampling, so it is effectively equivalent to using greedy decoding.frequency_penalty=0
: Controls the penalty applied based on token frequency. A value of 0 means no penalty is applied.presence_penalty=0
: Controls the penalty applied based on token presence in the input. A value of 0 means no penalty is applied.stop=["#", ";"]
: Specifies a list of tokens that, if encountered by the model, will cause the generation to stop. In this case, the generation will stop when it encounters a “#” or “;”.
The openai.Completion.create()
method returns a response object, which is stored in the response
variable. The generated text can be extracted from this object using response.choices[0].text
.
#Generate GPT Response
response = openai.Completion.create(
model = 'text-davinci-002',
prompt = combined (df, nlp_text),
temperature = 0,
max_tokens = 150,
top_p = 1.0,
frequency_penalty = 0,
presence_penalty = 0,
stop = ["#", ";"]
)
Step 12: Format the Response
Finally, we right a function to format the response from the GPT application:
#Format response
def handle_response(response):
query = response['choices'][0]['text']
if query.startswith(" "):
query = 'SELECT' + query
return query
Running the following snippet will return the desired NLP query to SQL query input:
#Get response
handle_response(response)
Your output should now look something like this:
"SELECT * FROM Sales WHERE STATUS = 'Shipped' AND YEAR_ID = 2003 AND QTR_ID = 3\n