[TIL] Automating Dataset Generation for Sentence Learning NLP Model

📌 What I tried: Automating Dataset Generation for Sentence Learning Models

To train a machine learning model with [sentences and their corresponding labels], a dataset of at least 1,000 examples was required. I had already set up the RNN model, and once an adequate training dataset was constructed, I could proceed with training the AI model to determine the severity level of sentences entered as input.

As manually creating thousands of datasets was not feasible, I explored ways to automate the process.

1. PDF file crawling

First, I crawled a PDF file containing keywords for severity levels and symptoms to extract keywords that could be used in symptom reporting sentences.

import pdfplumber
import re

adult_symptoms_info = []
child_symptoms_info = []

# For patients 15 years and older, pages 5-74
# For patients under 15 years old, pages 75-158
with pdfplumber.open('./emergency_level.pdf') as pdf:
    for i in range(5, 158):
        page = pdf.pages[i]
        page_text = page.extract_text()
        # Split the entire text by newline character
        page_text_arr = page_text.split("\n")
        for j in range(1, len(page_text_arr)):
            # Extract sentences ending with a number
            symptoms_text = re.findall(r".*[0-9]$", page_text_arr[j])
            if(symptoms_text):
                print(symptoms_text)
                # Split by alphabet (code)
                target_symptoms = re.split("[A-Z]+", symptoms_text[0])
                target_symptoms_wo_spaces = [part.strip() for part in target_symptoms]
                # Symptoms keywords for patients 15 years and older
                if(5 <= i <= 74):
                    adult_symptoms_info.append([target_symptoms_wo_spaces[0], target_symptoms_wo_spaces[1]])
                # Symptoms keywords for patients under 15 years old
                elif(75 <= i <= 158):
                    if(len(target_symptoms_wo_spaces) > 1):
                        child_symptoms_info.append([target_symptoms_wo_spaces[0], target_symptoms_wo_spaces[1]])

# Remove duplicate symptoms
unique_adult_data = set(tuple(x) for x in adult_symptoms_info)
unique_adult_list = [list(x) for x in unique_adult_data]
unique_child_data = set(tuple(x) for x in child_symptoms_info)
unique_child_list = [list(x) for x in unique_child_data]

2. Automatic sentence generation using the OpenAI API

Even with the keywords obtained, it was challenging to generate thousands of sentence datasets. Therefore, I used the ChatGPT API to iterate through our list of keywords and generate appropriate sentences for each severity level.

import os
import openai

from dotenv import load_dotenv
load_dotenv()

openai.organization = os.environ.get("ORGANIZATION_KEY")
openai.api_key = os.environ.get("OPENAPI_KEY")

completion = openai.ChatCompletion.create(
     model="gpt-3.5-turbo",
     messages=[
        {
             "role": "system", 
             "content": 
                   """
                      You will play the role of an emergency responder in an ambulance, and the patient is currently in the ambulance.
                      Level 1 is the most urgent, and level 5 is the least severe.
                      Just provide sentences without any additional instructions.
                      Write the sentences as if you were writing a report, separate them with double quotation marks, and only use line breaks.
                      Please write detailed sentences describing the patient's symptom state, with a minimum length of 25 characters.
                   """
        },
        {
             "role":

 "user", 
             "content": 
                   f"""
                      Provide three examples for each level (1, 2, 3, 4, 5) based on the symptoms of {adult_symptoms[i][0]}, {adult_symptoms[i][1]}.
                   """
                }
        ]
)

result = completion.choices[0].message.content
with open("sentences-adult.txt", "a", encoding="utf-8") as file:
     file.write(results + "\n")

3. Scheduling required due to API call limitations

When using the gpt-3.5 model, there is a constraint of making only 3 requests per minute. To comply with this constraint, I scheduled the API calls to be made only 3 times per minute.

import time, schedule

def schedule_api_calls():
    i = 0

    def job():
        nonlocal i
        print(f"{i}", ": ", adult_symptoms[i])
        if i < len(adult_symptoms):
            completion = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[]
            i += 1

        else:
            schedule.cancel_job(job)

    schedule.every(20).seconds.do(job)

schedule_api_calls()

while True:
    schedule.run_pending()
    time.sleep(1)

\=> Although I did write the code for automatic GPT API calls, the free version of GPT allows only 3 requests per minute, and the responses were inconsistent, leading to its abandonment.

[TIL] Automating Dataset Generation for Sentence Learning NLP Model

📌 What I tried: Automating Dataset Generation for Sentence Learning Models

1. PDF file crawling

2. Automatic sentence generation using the OpenAI API

3. Scheduling required due to API call limitations

Comments

Today I Learned

[TIL] AI Model - BERT

More from this blog

[코테] 그리디 문제 - 무지의 먹방 라이브

[코테] Bfs 토마토

[코테] Dfs 문제 유형 - 그래프 내에서 구분하여 카운트 하기

[코테] DFS vs BFS

[코테] 여행경로

Command Palette

📌 What I tried: Automating Dataset Generation for Sentence Learning Models

1. PDF file crawling

2. Automatic sentence generation using the OpenAI API

3. Scheduling required due to API call limitations

Comments

Today I Learned

[TIL] AI Model - BERT

More from this blog