-
Tensorflow Keras - 6 (Self Attention layer 이용하기)Machine Learning/Tensorflow 2021. 3. 20. 12:59반응형
2년전 캐글에 공유된 소스를 변경하여 작성하였습니다.
출처:
www.kaggle.com/arcisad/keras-bidirectional-lstm-self-attention?select=train.csv
Data set:
www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data?select=test.csv
Text Data이고, 텍스트의 문맥이 긍정인지 부정인지 라벨링이 되어있습니다.
라이브러리
#!pip install keras-self-attention #https://pypi.org/project/keras-self-attention/ import pandas as pd import numpy as np from tensorflow import keras from tensorflow.keras.preprocessing.text import one_hot, Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Flatten, Dense from tensorflow.keras.models import Sequential from keras_self_attention import SeqSelfAttention
self-attention layer는 따로 설치를 하시면 됩니다.
!pip install keras-self-attention
Data load & preprocessing
#load df = pd.read_csv('./train.csv') #preprocessing trainig_sample = df.sample(100000, random_state=0) X_train = trainig_sample['comment_text'].astype(str) X_train = X_train.fillna('DUMMY') y_train = trainig_sample['target'] y_train = y_train.apply(lambda x: 1 if x > 0.5 else 0)
X_train에는 comment_text만 y_train에는 target을 0과 1로 변경을 해줬습니다.
긍정, 부정으로 나누어 줬습니다.
def get_seqs(text): sequences = tokenizer.texts_to_sequences(text) padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post') return padded_sequences X_train = get_seqs(X_train)
앞전 게시글에 설명을 했듯이 자연어처리에 tokenizer는 필수입니다.
Hyper parameters
epochs = 2 max_num_words = 20000 max_length = 128 tokenizer = Tokenizer(num_words=max_num_words) tokenizer.fit_on_texts(X_train) word_index = tokenizer.word_index print('Found %s unique tokens.' % len(word_index))
Model Create
model = Sequential() model.add(Embedding(max_num_words, 100, input_length=max_length)) model.add(Bidirectional(LSTM(units=128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))) model.add(SeqSelfAttention(attention_activation='sigmoid')) model.add(Bidirectional(LSTM(units=64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))) model.add(SeqSelfAttention(attention_activation='sigmoid')) model.add(Dense(1, activation='sigmoid'))
순차적으로 Embedding을 진행하여, padding화 되어있는 텍스트들을 고차원으로 변경을 해줍니다.
그리고 양방향 LSTM을 이용하고, self-attention layer를 거치면서 attention score만큼 중요도가 높은 단어들에 대해서
높은 점수들을 줍니다.
그 뒤로 동일하게 처리했습니다.
model.summary()
Model: "sequential_5"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_5 (Embedding) (None, 128, 100) 2000000
_________________________________________________________________
bidirectional_6 (Bidirection (None, 128, 256) 234496
_________________________________________________________________
seq_self_attention_4 (SeqSel (None, None, 256) 16449
_________________________________________________________________
bidirectional_7 (Bidirection (None, None, 128) 164352
_________________________________________________________________
seq_self_attention_5 (SeqSel (None, None, 128) 8257
_________________________________________________________________
dense_4 (Dense) (None, None, 1) 129
=================================================================
Total params: 2,423,683
Trainable params: 2,423,683
Non-trainable params: 0
_________________________________________________________________model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
optimizer는 여전히 'adam'입니다. loss 또한 이중 클래스 분류 문제이기 때문에 binary_crossentropy를 사용했고,
metrics 또한 accuracy입니다.
Model fit
model.fit(X_train, y_train, epochs=epochs)
Epoch 1/2
3125/3125 [==============================] - 2896s 927ms/step - loss: 0.1534 - accuracy: 0.9525
Epoch 2/2
3125/3125 [==============================] - 3001s 960ms/step - loss: 0.1028 - accuracy: 0.9638Model test
validation_sample = df.sample(500, random_state=42) X_val = validation_sample['comment_text'].astype(str) X_val = X_val.fillna('DUMMY') y_val = validation_sample['target'] y_val = y_val.apply(lambda x: 1 if x > 0.5 else 0)
loss, accuracy = model.evaluate(get_seqs(X_val), y_val) print('Evaluation accuracy: {0}'.format(accuracy))
16/16 [==============================] - 2s 102ms/step - loss: 0.0967 - accuracy: 0.9660
Evaluation accuracy: 0.9660000205039978기존 X_train에 사용됐던 데이터를 샘플링하여 예측을 해봅니다.
Test file load & predict
test = pd.read_csv('./test.csv')
X_test = test['comment_text'].astype(str) X_test = X_test.fillna('DUMMY')
probs = model.predict(get_seqs(X_test), verbose=1)
probs = [x[0] for x in probs]
model.save("attention_md.h5")
submission = pd.DataFrame(test['id']).reset_index(drop=True) submission['prediction'] = pd.Series(probs, name='prediction') submission.to_csv('submission.csv', index=False)
동일하게 테스트 파일을 변환하여 예측을 해보고서, 캐글에 제출할 파일을 생성합니다.
--
전체 소스 다운로드
github.com/Joonyeong97/Tensorflow-tutorial
반응형'Machine Learning > Tensorflow' 카테고리의 다른 글
[Python,TF]Upbit API를 이용해서 코인 시세를 예측해보자! (0) 2021.04.15 Tensorflow Keras - 7일차 (Resnet50V2) 및 실제 이미지로 학습하기 (0) 2021.04.02 Tensorflow Keras - 5 (자연어처리,IMDB 데이터 이용하기) (0) 2021.03.17 Tensorflow Keras - 4 (자연어처리,감정분석) (0) 2021.03.16 Tensorflow Keras - 3 (전이학습,VGG16) (0) 2021.03.16