-
Tensorflow Keras - 5 (자연어처리,IMDB 데이터 이용하기)Machine Learning/Tensorflow 2021. 3. 17. 16:42반응형
keras data set 이용하기
IMDB 영화리뷰 데이터입니다.
import numpy as np from tensorflow.keras.datasets import imdb from tensorflow.keras.models import Sequential,load_model from tensorflow.keras.layers import Dense,Dropout,LSTM, Embedding,Flatten from tensorflow.keras.preprocessing import sequence from tensorflow.keras.callbacks import ModelCheckpoint import nltk from nltk.corpus import stopwords nltk.download('stopwords') vocab_size = 35000 maxlen=100 dictionary_word = imdb.get_word_index(path='imdb_word_index.json') dictionary_index = {value:key for key,value in zip(dictionary_word.keys(),dictionary_word.values())} stopwords_ = stopwords.words('english') (x_train, y_train), (x_test,y_test) = imdb.load_data(num_words = vocab_size)
오늘은 nltk에서 제공하는 영어문장 stopwords를 이용해서 불용어를 처리하고 진행하겠습니다.
Data load -> 불용어처리 -> Word Tokenizing -> Dictionary Encoding -> pad sequences ->
Model create -> Model compile -> Model fit -> Model predict
기존의 순서에서 Tokenizing 이후에 불용어 처리를 진행했습니다.
케라스에서 제공하는 IMDB 데이터는 이미 tokenizing까지는 되있지만 불용어는 처리가 안되어서 정확도가 많이 떨어졌습니다.
제일 먼저 IMDB에서 제공하는 index file을 받아와서 key:value를 뒤바꿔줍니다.
Stopwords Encoding
stopwords_idx=[] for idx in stopwords_: try: stopwords_idx.append(dictionary_word[idx]) except: continue
stopword를 imdb dictionary에 넣어서 encoding을 해줍니다.
그래야 imdb에서 전처리가 가능합니다.
imdb 데이터는 이미 숫자로 인코딩이 되어있는 상태이기 때문입니다.
불용어처리
def word_preprocessing(stopwords_idx,x_train): x_train_pre = [np.array(x) for x in x_train] x_train_pre = np.array(x_train_pre) for word in stopwords_idx: for idx,x in enumerate(x_train_pre): x_train_pre[idx] = np.delete(x,np.where(x==word)) return x_train_pre
imdb 데이터는 np.array에 list로 들어가있습니다.
전부 다 array로 변경해준뒤 불용어 index와 일일히 비교후 삭제를 해줬습니다.
최대한 빠른 연산을 하려고 했지만.. 30분동안 고민을 한결과
stopword가 171개 x_train이 25000개
25,000*171=4,275,000 총 4백만번은 for문을 돌려야했네요..
혹시라도 더 빠르게 전처리를 할 수 있다면 댓글로 남겨주시면 감사합니다.
x_train = word_preprocessing(stopwords_idx,x_train) x_test = word_preprocessing(stopwords_idx,x_test)
padding
x_train = sequence.pad_sequences(x_train, maxlen=maxlen,padding='post') x_test = sequence.pad_sequences(x_test, maxlen=maxlen,padding='post') print(x_train[1])
[ 194 1153 194 8255 228 1463 4369 5012 715 1634 394 954
189 102 207 110 3103 188 7 249 93 114 2300 1523
647 116 8163 229 340 1322 4901 19 1002 952 37 455
1543 398 1649 6853 163 3215 10156 1153 194 775 7 8255
11596 349 2637 148 605 15358 8003 123 125 23141 6853 349
165 4362 228 1157 299 120 120 174 220 175 136 4373
228 8255 25249 656 245 2350 9837 152 491 7464 1212 371
625 64 1382 1690 1355 28 154 462 285 0 0 0
0 0 0 0]Decode
for i in x_train[1]: print(i, ':', dictionary_index[i])
194 : thought 1153 : solid 194 : thought 8255 : senator 228 : making 1463 : spot 4369 : nomination 5012 : assumed 715 : jack 1634 : picked 394 : getting 954 : hands 189 : fact 102 : characters 207 : always 110 : life 3103 : thrillers 188 : can't 7 : br 249 : sure 93 : way 114 : little 2300 : strongly 1523 : random 647 : view 116 : love 8163 : principles 229 : guy 340 : used 1322 : producer 4901 : icon 19 : film 1002 : outside 952 : unique 37 : like 455 : direction 1543 : imagination 398 : keep 1649 : queen 6853 : diverse 163 : makes 3215 : stretch 10156 : stefan 1153 : solid 194 : thought 775 : begins 7 : br 8255 : senator 11596 : machinations 349 : budget 2637 : worthwhile 148 : though 605 : ok 15358 : brokedown 8003 : awaiting 123 : ever 125 : better 23141 : lugia 6853 : diverse 349 : budget 165 : look 4362 : kicked 228 : making 1157 : follows 299 : effects 120 : show 120 : show 174 : cast 220 : family 175 : us 136 : scenes 4373 : severe 228 : making 8255 : senator 25249 : levant's 656 : finds 245 : tv 2350 : tend 9837 : emerged 152 : thing 491 : wants 7464 : beckinsale 1212 : cult 371 : video 625 : david 64 : see 1382 : scenery 1690 : ship 1355 : wild 28 : one 154 : work 462 : dark 285 : dvd
생각보다 불용어 처리도 잘됐고 맘에드네요.
Model create
model = Sequential() model.add(Embedding(input_dim=vocab_size,output_dim=128,input_length=maxlen)) model.add(LSTM(128,dropout = 0.2, recurrent_dropout=0.2)) model.add(Dense(1,activation='sigmoid')) model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['acc']) che = 'keras_model.h5' point = ModelCheckpoint(filepath=che , monitor='val_loss', verbose=1, save_best_only=True)
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 100, 128) 4480000
_________________________________________________________________
lstm (LSTM) (None, 128) 131584
_________________________________________________________________
dense (Dense) (None, 1) 129
=================================================================
Total params: 4,611,713
Trainable params: 4,611,713
Non-trainable params: 0
_________________________________________________________________Model fit
model.fit(x_train,y_train,batch_size=32, epochs=10,validation_data = (x_test, y_test),callbacks=[point])
Epoch 00010: val_loss did not improve from 0.37573
782/782 [==============================] - 137s 175ms/step - loss: 0.0219 - acc: 0.9951 - val_loss: 0.7232 - val_acc: 0.8311생각보다 accuracy가 높게 나왔네요.. 99라니..
과적합일 확률이 큰듯한 느낌이네요.
검증
pred = model.predict(x_train[1]) print(np.round(pred),y_train[1])
[[0.] [0.] [0.] [0.] [0.] [1.] [1.] [1.] [1.] [1.] [0.] [0.] [0.] [0.] [0.] [1.] [0.] [1.] [0.] [0.] [0.] [0.] [1.] [0.] [0.] [0.] [0.] [0.] [0.] [0.] [1.] [1.] [1.] [0.] [1.] [1.] [1.] [1.] [0.] [0.] [0.] [1.] [1.] [0.] [0.] [1.] [0.] [0.] [0.] [0.] [1.] [1.] [1.] [1.] [1.] [1.] [0.] [1.] [0.] [0.] [1.] [0.] [0.] [0.] [0.] [0.] [0.] [0.] [1.] [1.] [1.] [1.] [0.] [0.] [1.] [0.] [0.] [0.] [1.] [0.] [1.] [0.] [1.] [0.] [0.] [0.] [1.] [0.] [0.] [1.] [0.] [0.] [1.] [0.] [0.] [0.] [0.] [0.] [0.] [0.]] 0
pred1 = np.round(pred) pred1 = pred1.reshape(-1).astype('int') for i,pred in zip(x_train[1],pred1): print(i, ':', dictionary_index[i],'/', 'predict : ', pred)
194 : thought / predict : 0
1153 : solid / predict : 0
194 : thought / predict : 0
8255 : senator / predict : 0
228 : making / predict : 0
1463 : spot / predict : 1
4369 : nomination / predict : 1
5012 : assumed / predict : 1
715 : jack / predict : 1
1634 : picked / predict : 1
394 : getting / predict : 0
954 : hands / predict : 0
189 : fact / predict : 0
102 : characters / predict : 0
207 : always / predict : 0
110 : life / predict : 1
3103 : thrillers / predict : 0
188 : can't / predict : 1
7 : br / predict : 0
249 : sure / predict : 0
93 : way / predict : 0
114 : little / predict : 0
2300 : strongly / predict : 1
1523 : random / predict : 0
647 : view / predict : 0
116 : love / predict : 0
8163 : principles / predict : 0
229 : guy / predict : 0
340 : used / predict : 0
1322 : producer / predict : 0
4901 : icon / predict : 1
19 : film / predict : 1
1002 : outside / predict : 1
952 : unique / predict : 0
37 : like / predict : 1
455 : direction / predict : 1
1543 : imagination / predict : 1
398 : keep / predict : 1
1649 : queen / predict : 0
6853 : diverse / predict : 0
163 : makes / predict : 0
3215 : stretch / predict : 1
10156 : stefan / predict : 1
1153 : solid / predict : 0
194 : thought / predict : 0
775 : begins / predict : 1
7 : br / predict : 0
8255 : senator / predict : 0
11596 : machinations / predict : 0
349 : budget / predict : 0
2637 : worthwhile / predict : 1
148 : though / predict : 1
605 : ok / predict : 1
15358 : brokedown / predict : 1
8003 : awaiting / predict : 1
123 : ever / predict : 1
125 : better / predict : 0
23141 : lugia / predict : 1
6853 : diverse / predict : 0
349 : budget / predict : 0
165 : look / predict : 1
4362 : kicked / predict : 0
228 : making / predict : 0
1157 : follows / predict : 0
299 : effects / predict : 0
120 : show / predict : 0
120 : show / predict : 0
174 : cast / predict : 0
220 : family / predict : 1
175 : us / predict : 1
136 : scenes / predict : 1
4373 : severe / predict : 1
228 : making / predict : 0
8255 : senator / predict : 0
25249 : levant's / predict : 1
656 : finds / predict : 0
245 : tv / predict : 0
2350 : tend / predict : 0
9837 : emerged / predict : 1
152 : thing / predict : 0
491 : wants / predict : 1
7464 : beckinsale / predict : 0
1212 : cult / predict : 1
371 : video / predict : 0
625 : david / predict : 0
64 : see / predict : 0
1382 : scenery / predict : 1
1690 : ship / predict : 0
1355 : wild / predict : 0
28 : one / predict : 1
154 : work / predict : 0
462 : dark / predict : 0
285 : dvd / predict : 1print('테스트 정확도 : %.4f'% (model.evaluate(x_test,y_test)[1]))
782/782 [==============================] - 10s 13ms/step - loss: 0.7232 - acc: 0.8311
테스트 정확도 : 0.8311Model load
model = load_model('keras_model.h5')
Github
github.com/Joonyeong97/Tensorflow-tutorial
반응형'Machine Learning > Tensorflow' 카테고리의 다른 글
Tensorflow Keras - 7일차 (Resnet50V2) 및 실제 이미지로 학습하기 (0) 2021.04.02 Tensorflow Keras - 6 (Self Attention layer 이용하기) (0) 2021.03.20 Tensorflow Keras - 4 (자연어처리,감정분석) (0) 2021.03.16 Tensorflow Keras - 3 (전이학습,VGG16) (0) 2021.03.16 Tensorflow Keras - 2 (CNN,이미지 학습,mnist,cifar10) (0) 2021.03.15