ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Tensorflow Keras - 6 (Self Attention layer 이용하기)
    Machine Learning/Tensorflow 2021. 3. 20. 12:59
    반응형

    2년전 캐글에 공유된 소스를 변경하여 작성하였습니다.

     

    출처:

    www.kaggle.com/arcisad/keras-bidirectional-lstm-self-attention?select=train.csv

     

    Keras Bidirectional LSTM + Self-Attention

    Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

    www.kaggle.com

    Data set:

    www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data?select=test.csv

     

    Jigsaw Unintended Bias in Toxicity Classification

    Detect toxicity across a diverse range of conversations

    www.kaggle.com

    Text Data이고, 텍스트의 문맥이 긍정인지 부정인지 라벨링이 되어있습니다.

     

    라이브러리

    #!pip install keras-self-attention
    
    #https://pypi.org/project/keras-self-attention/
    import pandas as pd
    import numpy as np
    
    from tensorflow import keras
    from tensorflow.keras.preprocessing.text import one_hot, Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Flatten, Dense
    
    from tensorflow.keras.models import Sequential
    from keras_self_attention import SeqSelfAttention

    self-attention layer는 따로 설치를 하시면 됩니다.

     

    !pip install keras-self-attention

     

    Data load & preprocessing

    #load
    df = pd.read_csv('./train.csv')
    
    #preprocessing
    trainig_sample = df.sample(100000, random_state=0)
    X_train = trainig_sample['comment_text'].astype(str)
    X_train = X_train.fillna('DUMMY')
    y_train = trainig_sample['target']
    y_train = y_train.apply(lambda x: 1 if x > 0.5 else 0)

    X_train에는 comment_text만 y_train에는 target을 0과 1로 변경을 해줬습니다.

    긍정, 부정으로 나누어 줬습니다.

    def get_seqs(text):
        sequences = tokenizer.texts_to_sequences(text)
        padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')
        return padded_sequences
        
        
    X_train = get_seqs(X_train)

    앞전 게시글에 설명을 했듯이 자연어처리에 tokenizer는 필수입니다.

     

    Hyper parameters

    epochs = 2
    max_num_words = 20000
    max_length = 128
    tokenizer = Tokenizer(num_words=max_num_words)
    tokenizer.fit_on_texts(X_train)
    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))

    Model Create

    model = Sequential()
    model.add(Embedding(max_num_words, 100, input_length=max_length))
    model.add(Bidirectional(LSTM(units=128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)))
    model.add(SeqSelfAttention(attention_activation='sigmoid'))
    model.add(Bidirectional(LSTM(units=64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)))
    model.add(SeqSelfAttention(attention_activation='sigmoid'))
    model.add(Dense(1, activation='sigmoid'))

    순차적으로 Embedding을 진행하여, padding화 되어있는 텍스트들을 고차원으로 변경을 해줍니다.

    그리고 양방향 LSTM을 이용하고, self-attention layer를 거치면서 attention score만큼 중요도가 높은 단어들에 대해서

    높은 점수들을 줍니다.

    그 뒤로 동일하게 처리했습니다.

     

    model.summary()

    Model: "sequential_5"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    embedding_5 (Embedding)      (None, 128, 100)          2000000   
    _________________________________________________________________
    bidirectional_6 (Bidirection (None, 128, 256)          234496    
    _________________________________________________________________
    seq_self_attention_4 (SeqSel (None, None, 256)         16449     
    _________________________________________________________________
    bidirectional_7 (Bidirection (None, None, 128)         164352    
    _________________________________________________________________
    seq_self_attention_5 (SeqSel (None, None, 128)         8257      
    _________________________________________________________________
    dense_4 (Dense)              (None, None, 1)           129       
    =================================================================
    Total params: 2,423,683
    Trainable params: 2,423,683
    Non-trainable params: 0
    _________________________________________________________________

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

     

    optimizer는 여전히 'adam'입니다. loss 또한 이중 클래스 분류 문제이기 때문에 binary_crossentropy를 사용했고, 

    metrics 또한 accuracy입니다.

     

    Model fit

    model.fit(X_train, y_train, epochs=epochs)

    Epoch 1/2
    3125/3125 [==============================] - 2896s 927ms/step - loss: 0.1534 - accuracy: 0.9525
    Epoch 2/2
    3125/3125 [==============================] - 3001s 960ms/step - loss: 0.1028 - accuracy: 0.9638

     

    Model test

    validation_sample = df.sample(500, random_state=42)
    X_val = validation_sample['comment_text'].astype(str)
    X_val = X_val.fillna('DUMMY')
    y_val = validation_sample['target']
    y_val = y_val.apply(lambda x: 1 if x > 0.5 else 0)
    loss, accuracy = model.evaluate(get_seqs(X_val), y_val)
    print('Evaluation accuracy: {0}'.format(accuracy))

    16/16 [==============================] - 2s 102ms/step - loss: 0.0967 - accuracy: 0.9660
    Evaluation accuracy: 0.9660000205039978

     

    기존 X_train에 사용됐던 데이터를 샘플링하여 예측을 해봅니다.

     

    Test file load & predict

    test = pd.read_csv('./test.csv')
    X_test = test['comment_text'].astype(str)
    X_test = X_test.fillna('DUMMY')
    probs = model.predict(get_seqs(X_test), verbose=1)
    probs = [x[0] for x in probs]
    model.save("attention_md.h5")
    submission = pd.DataFrame(test['id']).reset_index(drop=True)
    submission['prediction'] = pd.Series(probs, name='prediction')
    submission.to_csv('submission.csv', index=False)

     

    동일하게 테스트 파일을 변환하여 예측을 해보고서, 캐글에 제출할 파일을 생성합니다.

     

    --

    전체 소스 다운로드

    github.com/Joonyeong97/Tensorflow-tutorial

     

    Joonyeong97/Tensorflow-tutorial

    GitHub Desktop tutorial repository. Contribute to Joonyeong97/Tensorflow-tutorial development by creating an account on GitHub.

    github.com

     

    반응형
Designed by Tistory.