Skip to main content

Command Palette

Search for a command to run...

[TIL] Automating Dataset Generation for Sentence Learning NLP Model

06/12/23

Published
[TIL] Automating Dataset Generation for Sentence Learning NLP Model

๐Ÿ“Œ What I tried: Automating Dataset Generation for Sentence Learning Models

To train a machine learning model with [sentences and their corresponding labels], a dataset of at least 1,000 examples was required. I had already set up the RNN model, and once an adequate training dataset was constructed, I could proceed with training the AI model to determine the severity level of sentences entered as input.

As manually creating thousands of datasets was not feasible, I explored ways to automate the process.

1. PDF file crawling

First, I crawled a PDF file containing keywords for severity levels and symptoms to extract keywords that could be used in symptom reporting sentences.

import pdfplumber
import re

adult_symptoms_info = []
child_symptoms_info = []

# For patients 15 years and older, pages 5-74
# For patients under 15 years old, pages 75-158
with pdfplumber.open('./emergency_level.pdf') as pdf:
    for i in range(5, 158):
        page = pdf.pages[i]
        page_text = page.extract_text()
        # Split the entire text by newline character
        page_text_arr = page_text.split("\n")
        for j in range(1, len(page_text_arr)):
            # Extract sentences ending with a number
            symptoms_text = re.findall(r".*[0-9]$", page_text_arr[j])
            if(symptoms_text):
                print(symptoms_text)
                # Split by alphabet (code)
                target_symptoms = re.split("[A-Z]+", symptoms_text[0])
                target_symptoms_wo_spaces = [part.strip() for part in target_symptoms]
                # Symptoms keywords for patients 15 years and older
                if(5 <= i <= 74):
                    adult_symptoms_info.append([target_symptoms_wo_spaces[0], target_symptoms_wo_spaces[1]])
                # Symptoms keywords for patients under 15 years old
                elif(75 <= i <= 158):
                    if(len(target_symptoms_wo_spaces) > 1):
                        child_symptoms_info.append([target_symptoms_wo_spaces[0], target_symptoms_wo_spaces[1]])

# Remove duplicate symptoms
unique_adult_data = set(tuple(x) for x in adult_symptoms_info)
unique_adult_list = [list(x) for x in unique_adult_data]
unique_child_data = set(tuple(x) for x in child_symptoms_info)
unique_child_list = [list(x) for x in unique_child_data]

2. Automatic sentence generation using the OpenAI API

Even with the keywords obtained, it was challenging to generate thousands of sentence datasets. Therefore, I used the ChatGPT API to iterate through our list of keywords and generate appropriate sentences for each severity level.

import os
import openai

from dotenv import load_dotenv
load_dotenv()

openai.organization = os.environ.get("ORGANIZATION_KEY")
openai.api_key = os.environ.get("OPENAPI_KEY")

completion = openai.ChatCompletion.create(
     model="gpt-3.5-turbo",
     messages=[
        {
             "role": "system", 
             "content": 
                   """
                      You will play the role of an emergency responder in an ambulance, and the patient is currently in the ambulance.
                      Level 1 is the most urgent, and level 5 is the least severe.
                      Just provide sentences without any additional instructions.
                      Write the sentences as if you were writing a report, separate them with double quotation marks, and only use line breaks.
                      Please write detailed sentences describing the patient's symptom state, with a minimum length of 25 characters.
                   """
        },
        {
             "role":

 "user", 
             "content": 
                   f"""
                      Provide three examples for each level (1, 2, 3, 4, 5) based on the symptoms of {adult_symptoms[i][0]}, {adult_symptoms[i][1]}.
                   """
                }
        ]
)

result = completion.choices[0].message.content
with open("sentences-adult.txt", "a", encoding="utf-8") as file:
     file.write(results + "\n")

3. Scheduling required due to API call limitations

When using the gpt-3.5 model, there is a constraint of making only 3 requests per minute. To comply with this constraint, I scheduled the API calls to be made only 3 times per minute.

import time, schedule

def schedule_api_calls():
    i = 0

    def job():
        nonlocal i
        print(f"{i}", ": ", adult_symptoms[i])
        if i < len(adult_symptoms):
            completion = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[]
            i += 1

        else:
            schedule.cancel_job(job)

    schedule.every(20).seconds.do(job)

schedule_api_calls()

while True:
    schedule.run_pending()
    time.sleep(1)

\=> Although I did write the code for automatic GPT API calls, the free version of GPT allows only 3 requests per minute, and the responses were inconsistent, leading to its abandonment.

More from this blog

[์ฝ”ํ…Œ] ๊ทธ๋ฆฌ๋”” ๋ฌธ์ œ - ๋ฌด์ง€์˜ ๋จน๋ฐฉ ๋ผ์ด๋ธŒ

https://school.programmers.co.kr/learn/courses/30/lessons/42891 ํšจ์œจ์„ฑ ํ…Œ์ŠคํŠธ์— ์‹ ๊ฒฝ์จ์•ผ ํ•˜๋Š” ๋ฌธ์ œ ์šฐ์„ ์ˆœ์œ„ ํ๋ฅผ ํ™œ์šฉํ•ด์„œ ๋จน๋Š” ์‹œ๊ฐ„์ด ์งง์€ ์Œ์‹๋ถ€ํ„ฐ ํ์—์„œ ๋นผ๊ธฐ import heapq # ์šฐ์„ ์ˆœ์œ„ํ ํ™œ์šฉ: food_time์ด ์งง์€ ์Œ์‹๋ถ€ํ„ฐ ์‚ญ์ œ def solution(food_times, k): if sum(food_times) <= k: return -1 ...

Apr 4, 2024
[์ฝ”ํ…Œ] ๊ทธ๋ฆฌ๋”” ๋ฌธ์ œ - ๋ฌด์ง€์˜ ๋จน๋ฐฉ ๋ผ์ด๋ธŒ

[์ฝ”ํ…Œ] ์—ฌํ–‰๊ฒฝ๋กœ

๐Ÿ’ก [์ถœ๋ฐœ์ง€, ๋„์ฐฉ์ง€] ํ˜•ํƒœ๋กœ ์ฃผ์–ด์ง„ ๋น„ํ–‰๊ธฐ ํ‹ฐ์ผ“์„ ํ†ตํ•ด ๋ชจ๋“  ํ‹ฐ์ผ“์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ์˜ ๊ณตํ•ญ์„ ๋ฐฉ๋ฌธ ์ˆœ์„œ ๊ตฌํ•˜๊ธฐ (๋‹จ, ์—ฌ๋Ÿฌ ๊ณตํ•ญ์„ ๋ฐฉ๋ฌธํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ ์•ŒํŒŒ๋ฒณ์ด ๋น ๋ฅธ ๊ณตํ•ญ๋ถ€ํ„ฐ ๋ฐฉ๋ฌธํ•œ๋‹ค.) ํ‹€๋ ธ๋˜ ์ฝ”๋“œ from collections import defaultdict def dfs(graph, route, depart): if graph[depart]: connected = graph[depart][0] ...

Feb 26, 2024
[์ฝ”ํ…Œ] ์—ฌํ–‰๊ฒฝ๋กœ

siwon.log

161 posts