In this article I'll show you how to create and train a custom GPT specifically for your company using Python.
You might be paying for an enterprise subscription to a public GPT via OpenAI, and it's working well for you. In fact its utterly amazing! But what if I told you it could be even better, it could know your business, customers, employees, products, and more at a level that cannot be achieved by a public model?
And that my friend is the secret sauce to this technology. Providing very specific information on how your business operates, what your standards (customer service, code, reputational, etc) are, what your risk appetite and security posture is currently, your core company values!
Public models can't get down to that level...but you can!!!
Would that interest you?
Of course it would. What if I told you can do it for pretty much free with one Python developer and a data analyst in a few hours. The amount of time it takes is governed by how long it takes to collect your information (data/code/knowledge based, standards, guidelines docs) into a common location to train the model.
In fact, the actual code only takes a few minutes to tweak and build the model, just copy the code in this article, point it at your documents directory and start training!
We are going to create a simple python cli program that will have the ability to train and execute a new custom GPT model. This will enable training and chat with the GPT by team members, even non-developers.
The program is simple but powerful.
Let's start out by training a new small GPT on your laptop. You'll only need to move to a server with more horsepower when the number of artifacts grows to tens of gigs and up.
Let's create a new directory called custom_gpt and two subdirectories named model and knowledge.
c:\2298-software\
c:\2298-software\gpt\model\
c:\2298-software\gpt\knowledge\
Now we need to think about the type of model we want to train so that we can determine the data we need to feed into it. I suggest that you create a code-based model that can generates code according to your company's standards, styles, etc. So let's copy all of our python files in the knowledge directory. You could clone all the repos for your company and then simple run a find command on your workstation to find and copy all ".py" files into the knowledge directory.
You'll also want to copy supplemental documents like README.md (markdown), html, and any other files that contain documentation and/or explanation of the code. Knowledge bases, jira tickets, etc., the more, the merrier.
Now that we have loaded up our knowledge directory with knowledge, lets write the program.
Let's start coding up the gpt.py file. The program consists of a driver, class and json configuration file.
The driver program provides a simple cli menu where the user can choose to train a new GPT or have a conversation with an existing GPT.
Import our required libraries and open our class object
import json
import logging
import os
from typing import List
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, \
ServiceContext, Document, StorageContext, load_index_from_storage
class CustomGPT(object):
Here we create our instance level variables and set some default values
def __init__(self):
logging.basicConfig()
self.log = logging.getLogger(self.__class__.__name__)
self.log.info('Program is starting')
self.model_path = None
self.data_path = None
self.training_documents = None
self.model = None
self.company_name = None
Here we are going to load the documents from the data path which is the knowledge directory and set the training_documents object variable
def load_training_data(self):
self.training_documents: List[Document] = SimpleDirectoryReader(self.data_path).load_data()
self.log.info('training data has been loaded')
This function will call the load_training_data function which will load documents from your knowledge directory, building the index and then call the save_model function.
def create_model(self):
self.load_training_data()
service_context = ServiceContext.from_defaults()
self.model = GPTVectorStoreIndex.from_documents(documents=self.training_documents,
service_context=service_context,
show_progress=True)
self.save_model()
self.log.info('model has been created')
This function will save the model to the "model" directory
def save_model(self):
self.model.storage_context.persist(persist_dir=self.model_path)
self.log.info('model has been saved')
This function will attempt to load the model which has been stored in the "model" directory
def load_model(self):
try:
storage_context = StorageContext.from_defaults(persist_dir=self.model_path)
self.model = load_index_from_storage(storage_context)
self.log.info('model has been loaded')
except FileNotFoundError:
self.log.info(
f'You are attempting to query a model but a model does not exist in' \
'the path you provided: {self.model_path}')
This function loads the model as a query engine and then handles the prompt submission and response.
def query_model(self):
query_engine = self.model.as_query_engine()
while True:
prompt = input("Please provide a prompt/query/question for the GPT: ")
response = query_engine.query(prompt)
print(f'{response}\n')
This is the entry point for the object. Once you initialize the CustomGPT class (see driver) you then call execute() to start the program
def execute(self, args):
# The only required parameter is a path to the config file.
if len(args) != 2:
self.graceful_exit()
conf_file_path = args[1]
self.log.info(f'Conf path is {conf_file_path}')
if not os.path.exists(conf_file_path):
self.graceful_exit()
# Read the json file and set instance variables
with open(conf_file_path) as f:
data = f.read()
conf = json.loads(data)
# Company name will be used in the CLI menu.
self.company_name = conf['company_name']
# This is where we will store and load models
self.model_path = conf['model_path']
# This is where our data artifacts will be stored to support training the GPT
self.data_path = conf['data_path']
# Infinite loop providing a cli interface to the end user. User can exit by typing exit in the menu.
while True:
mode = input(f"Welcome to {self.company_name}'s GPT library!\n "
"\n1: Train New or Refresh Existing GPT"
f"\n2: Chat with the {self.company_name} GPT"
"\nExit: Exit Program"
"\n\nPlease choose an option: ")
# Take action based on the user selection.
if mode == '1':
# Train New or Refresh Existing GPT
self.create_model()
elif mode == '2':
# Chat with the {self.company_name} GPT
self.load_model()
self.query_model()
elif mode == 'Exit':
# Exit Program
print('Goodbye!')
exit(0)
else:
print('Please chose a list option.')
print('\n\n')
The configuration file holds the path to the location where GPT and data is stored.
{
"company_name": "2298 Software",
"model_path": "model",
"data_path": "data"
}
import json
import os
import sys
from custom_gpt import CustomGPT
if __name__ == '__main__':
try:
# Create a new CustomGPT object
cgpt = CustomGPT()
# Pass arguments through to the object
cgpt.execute(sys.argv)
except Exception as e:
print(f'The following exception has occurred: {e}')
Now that we have created the GPT program we simply need to execute the program as an end user would and train the model.
Welcome to 2298 Software's GPT library!
1: Train New or Refresh Existing GPT
2: Chat with the 2298 Software GPT
Exit: Exit Program
Please choose an option:
If you run it now (Option 1) without any information in the knowledge directory it will create an empty model that will not be able to answer any questions. It will just tell you it doesn't know the answer for the question.
Now you want to collect any and every document that contains information and put them into the knowledge directory. The best docs are internal training documentation and/or best practices. Grab all the application development standards, security requirements, architect guidelines, source code and markdown files and dump them in there.
After all your data in loaded then input Option 1 and enter. You'll see the following:
Please choose an option: 1
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 32.26it/s]
Generating embeddings: 100%|██████████| 2/2 [00:01<00:00, 1.16it/s]
Mine executed quickly because I only had a few documents in my knowledge directory. It can take minutes or hours depending on how much data you have.
Now we can ask the model a question and it will answer based on the data provided. Input option 2 and you'll be asked to provide a prompt/query/question. In this example I ask "What are the best practices around partitioning in pyspark?"
Please choose an option: 2
Please provide a prompt/query/question for the GPT: What are the best practices around partitioning in pyspark?
Partitioning in PySpark is essential for query performance from an end-user and application perspective. It helps to improve the efficiency of data processing and retrieval. When partitioning data in PySpark, it is recommended to consider the following best practices:
1. Choose the right partitioning strategy: Select a partitioning strategy that aligns with your specific use case and query patterns. Common partitioning strategies include range partitioning, hash partitioning, and list partitioning.
2. Partition based on relevant columns: Identify the columns that are frequently used in queries and partition the data based on those columns. For example, if you often query data based on transaction date, consider partitioning the data by the transaction date column.
3. Use multi-level partitions when necessary: If your data is too large to be contained in a single-level partition, consider using multi-level partitions. This involves splitting the data based on multiple columns, such as asofdate, source system, transaction date, and transaction state.
4. Consider the data distribution: Analyze the distribution of data within each partition to ensure a balanced distribution. This can help prevent data skew and improve query performance.
5. Optimize partition size: Aim for an optimal partition size that balances the number of partitions and the size of each partition. This can vary depending on the size of your data and the available resources.
6. Regularly analyze and optimize partitions: Periodically review and optimize your partitioning strategy based on the changing requirements of your data and queries. This may involve repartitioning the data or adjusting the partitioning columns.
Remember, partitioning in PySpark is a trade-off between query performance and storage overhead. It is important to carefully consider your specific use case and workload characteristics when designing and implementing partitioning strategies.
Please provide a prompt/query/question for the GPT:
As you can see our new custom model provided excellent guidance on partitioning in PySpark.
That's it, you have built a custom model. Now just keep asking questions and if the model cannot answer the question or the answer falls short then find more documentation/guidelines and put them in the knowledge directory, use Option 1 to refresh your model and ask the question again!
Simply migrate the code to a lambda function, wrap an API around it and throw up a quick web UI and you have a pretty nice ChatGPT ChatBot
Thanks for reading this article!