[TIL] Automating Dataset Generation for Sentence Learning NLP Model
06/12/23
![[TIL] Automating Dataset Generation for Sentence Learning NLP Model](/_next/image?url=https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1688369807963%2Fd57e7d4e-c325-455a-98bb-a4963726b1be.png&w=3840&q=75)
๐ What I tried: Automating Dataset Generation for Sentence Learning Models
To train a machine learning model with [sentences and their corresponding labels], a dataset of at least 1,000 examples was required. I had already set up the RNN model, and once an adequate training dataset was constructed, I could proceed with training the AI model to determine the severity level of sentences entered as input.
As manually creating thousands of datasets was not feasible, I explored ways to automate the process.
1. PDF file crawling
First, I crawled a PDF file containing keywords for severity levels and symptoms to extract keywords that could be used in symptom reporting sentences.
import pdfplumber
import re
adult_symptoms_info = []
child_symptoms_info = []
# For patients 15 years and older, pages 5-74
# For patients under 15 years old, pages 75-158
with pdfplumber.open('./emergency_level.pdf') as pdf:
for i in range(5, 158):
page = pdf.pages[i]
page_text = page.extract_text()
# Split the entire text by newline character
page_text_arr = page_text.split("\n")
for j in range(1, len(page_text_arr)):
# Extract sentences ending with a number
symptoms_text = re.findall(r".*[0-9]$", page_text_arr[j])
if(symptoms_text):
print(symptoms_text)
# Split by alphabet (code)
target_symptoms = re.split("[A-Z]+", symptoms_text[0])
target_symptoms_wo_spaces = [part.strip() for part in target_symptoms]
# Symptoms keywords for patients 15 years and older
if(5 <= i <= 74):
adult_symptoms_info.append([target_symptoms_wo_spaces[0], target_symptoms_wo_spaces[1]])
# Symptoms keywords for patients under 15 years old
elif(75 <= i <= 158):
if(len(target_symptoms_wo_spaces) > 1):
child_symptoms_info.append([target_symptoms_wo_spaces[0], target_symptoms_wo_spaces[1]])
# Remove duplicate symptoms
unique_adult_data = set(tuple(x) for x in adult_symptoms_info)
unique_adult_list = [list(x) for x in unique_adult_data]
unique_child_data = set(tuple(x) for x in child_symptoms_info)
unique_child_list = [list(x) for x in unique_child_data]
2. Automatic sentence generation using the OpenAI API
Even with the keywords obtained, it was challenging to generate thousands of sentence datasets. Therefore, I used the ChatGPT API to iterate through our list of keywords and generate appropriate sentences for each severity level.
import os
import openai
from dotenv import load_dotenv
load_dotenv()
openai.organization = os.environ.get("ORGANIZATION_KEY")
openai.api_key = os.environ.get("OPENAPI_KEY")
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content":
"""
You will play the role of an emergency responder in an ambulance, and the patient is currently in the ambulance.
Level 1 is the most urgent, and level 5 is the least severe.
Just provide sentences without any additional instructions.
Write the sentences as if you were writing a report, separate them with double quotation marks, and only use line breaks.
Please write detailed sentences describing the patient's symptom state, with a minimum length of 25 characters.
"""
},
{
"role":
"user",
"content":
f"""
Provide three examples for each level (1, 2, 3, 4, 5) based on the symptoms of {adult_symptoms[i][0]}, {adult_symptoms[i][1]}.
"""
}
]
)
result = completion.choices[0].message.content
with open("sentences-adult.txt", "a", encoding="utf-8") as file:
file.write(results + "\n")
3. Scheduling required due to API call limitations
When using the gpt-3.5 model, there is a constraint of making only 3 requests per minute. To comply with this constraint, I scheduled the API calls to be made only 3 times per minute.
import time, schedule
def schedule_api_calls():
i = 0
def job():
nonlocal i
print(f"{i}", ": ", adult_symptoms[i])
if i < len(adult_symptoms):
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[]
i += 1
else:
schedule.cancel_job(job)
schedule.every(20).seconds.do(job)
schedule_api_calls()
while True:
schedule.run_pending()
time.sleep(1)
\=> Although I did write the code for automatic GPT API calls, the free version of GPT allows only 3 requests per minute, and the responses were inconsistent, leading to its abandonment.
![[์ฝํ
] ๊ทธ๋ฆฌ๋ ๋ฌธ์ - ๋ฌด์ง์ ๋จน๋ฐฉ ๋ผ์ด๋ธ](/_next/image?url=https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1712215455263%2F1ac1f35a-8862-4e42-8d0c-e2bea01e04c0.png&w=3840&q=75)
![[์ฝํ
] Bfs ํ ๋งํ](/_next/image?url=https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1709032619170%2F70056896-c857-444b-9c99-45bfcb466806.png&w=3840&q=75)
![[์ฝํ
] Dfs ๋ฌธ์ ์ ํ - ๊ทธ๋ํ ๋ด์์ ๊ตฌ๋ถํ์ฌ ์นด์ดํธ ํ๊ธฐ](/_next/image?url=https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1709019361383%2Fb0585d72-c808-4169-83a9-2724f312e927.png&w=3840&q=75)
![[์ฝํ
] DFS vs BFS](/_next/image?url=https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1708971211123%2F71f9386c-6a62-43b2-a602-4d084c24d6cf.png&w=3840&q=75)
![[์ฝํ
] ์ฌํ๊ฒฝ๋ก](/_next/image?url=https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1708971251412%2F27ce72ed-8ee7-4d13-a02f-ff4bbe50c4be.png&w=3840&q=75)