AIHub 데이터셋 가공

데이터 분석

AI HUB 한국어 글자체 이미지 중 TEXT IN THE WILD 다운로드

AI HUB 다운로드.png

import json
with open('./textinthewild_data_info.json', 'rt', encoding='UTF8') as file:
    file = json.load(file)
file.keys() #dict_keys(['info', 'images', 'annotations', 'licenses'])
file['info'] #{'name': 'Text in the wild Dataset', 'date_created': '2019-10-14 04:31:48'}
type(file['images']) #list

file['images'][0]['type'] == 'books' # True
goods = [f for f in file['images'] if f['type']=='product']
len(goods) #26340

annotation = [a for a in file['annotations'] if a['image_id'] == goods[0]['id'] and a['attributes']['class']=='word']
annotation

데이터 분석 결과 1.png

import matplotlib.pyplot as plt
img = plt.imread('/data/nengcipe/dataset/Goods/'+goods[0]['file_name'])
plt.imshow(img)

데이터 분석 결과 2.png

데이터 1차 가공 - AI HUB 데이터 분할

데이터를 annotation 단위로 분할한다.

import random
import os

ocr_good_files = os.listdir('/data/nengcipe/dataset/Goods/')
print(len(ocr_good_files)) # 26340 데이터 가공시 확인해보니 태깅이 잘못 된 부분 o

#random.shuffle(ocr_good_files)

n_train = int(len(ocr_good_files) * 0.7)
n_validation = int(len(ocr_good_files) * 0.2)
n_test = int(len(ocr_good_files) * 0.1)

print(n_train, n_validation, n_test) #18438 5268 2634
#70 : 20 : 10의 비율로 나눠서 순서대로 이미지를 나누고 각 이미지에 해당하는 annotation 정보를 함께 저장

train_files = ocr_good_files[:n_train]
validation_files = ocr_good_files[n_train: n_train+n_validation]
test_files = ocr_good_files[-n_test:]

##train/validation/test 이미지들에 해당하는 id 값들을 저장 
train_img_ids = {}
validation_img_ids = {}
test_img_ids = {}

for image in file['images']:
    if image['file_name'] in train_files:
        train_img_ids[image['file_name']] = image['id']
    elif image['file_name'] in validation_files:
        validation_img_ids[image['file_name']] = image['id']
    elif image['file_name'] in test_files:
        test_img_ids[image['file_name']] = image['id']
    
print(len(train_img_ids)) # 18440으로 잘 들어갔음을 확인함

##train/validation/test 이미지들에 해당하는 annotation 들을 저장
train_annotations = {f:[] for f in train_img_ids.keys()} 
# train_img_ids란 딕셔너리의 key값을 모두 리스트로 가져오는 코드
validation_annotations = {f:[] for f in validation_img_ids.keys()}
test_annotations = {f:[] for f in test_img_ids.keys()}

train_ids_img = {train_img_ids[id_]:id_ for id_ in train_img_ids}
# train_img_ids란 딕셔너리의 key값을 가져와서 새로운 딕셔너리로 만드는 코드
# {id_: id_} 형태의 딕셔너리를 만들어서 {새로운 딕셔너리}에 추가하는 것 
validation_ids_img = {validation_img_ids[id_]:id_ for id_ in validation_img_ids}
test_ids_img = {test_img_ids[id_]:id_ for id_ in test_img_ids}

for idx, annotation in enumerate(file['annotations']):
    if idx % 5000 == 0:
        print(idx,'/',len(file['annotations']),'processed')
    if annotation['attributes']['class'] != 'word':
        continue
    if annotation['image_id'] in train_ids_img:
        train_annotations[train_ids_img[annotation['image_id']]].append(annotation)
    elif annotation['image_id'] in validation_ids_img:
        validation_annotations[validation_ids_img[annotation['image_id']]].append(annotation)
    elif annotation['image_id'] in test_ids_img:
        test_annotations[test_ids_img[annotation['image_id']]].append(annotation)

with open('train_annotation.json', 'w') as file:
    json.dump(train_annotations, file)
with open('validation_annotation.json', 'w') as file:
    json.dump(validation_annotations, file)
with open('test_annotation.json', 'w') as file:
    json.dump(test_annotations, file)

[output]

결과적으로 annotation 파일 생성

⇒ 실제 train_annotaion 파일의 모습이다. json이 구조적이지 않게 나왔기 때문에 validator를 사용하여 정렬했다.

The JSON Validator

데이터 2차 가공

각 이미지에 해당하는 annotation 값을 이용해 'bbox' 위치 정보로 단어 영역을 자름