Pytorchでtransformer(その0 IMBd datasetの準備)

書いてる理由

NLPやるぞー
レビューがポジティブかネガティブかを判断する

参考

概要

IMDbという映画のレビューのテキスト(英語)を使って、テキストがポジティブかネガティブかを判定するためのモデルを作る。ためのデータの準備。
学習するのに必要な形式では入っていないので、学習できるように整形していく。
例えば、レビューには、
とかの改行コードや!とか?などの記号が入っているので、これらを削除する前処理をする。
などなど。
ちなみに、前回まで日本語のデータで説明してたのにいきなり英語のデータになるのはすません。日本語のレビューデータ持って来ればいいんだけど、参考の書籍が英語のデータ使ってるから勘弁してください。

コード

github.com

詳細

IMDbのデータをダウンロードして解凍すると、aclImdbというディレクトリが出来上がり、その下にtrain/posとtrain/negができてその配下にテキストファイルが25000ずつ格納されている。
これらのテキスト1つ1つが映画のレビューとなっている。posはレーティングが1~10の7以上の高評価、negはレーティングが4以下の低評価になっている。
これを一つのファイルにまとめ、レビュー文章 \t 1 or 0 (1なら高評価、0なら低評価)という形に整形していく。

ちなみに一つのテキストファイルを見ると、以下のような感じ。(train/pos/12341_7.txt)

hough Lang's version is more famous,Borzage's work is not devoid of interest ,far from it:its "celestial" sequences are even better.
The metaphor of the train (perhaps borrowed from the ending of Abel Gance's "la roue" ) is eventually more convincing than the "up above" heavenly world.
<br /><br />Borzage's tenderness for his characters shows in Marie's character and love beyond the grave is one of his favorite subjects (the ending of "three comrades" ).
The amusement park seems to be everywhere: we see it even when we are in Marie's poor house.I do not think that the sets are that much cheesy,they are stylized to a fault.
The fair from a distance almost gives a sci-fi feel to the movie.
<br /><br />Borzage never forgets his social concerns: in the heavenly train going up,the Rich cannot stand to be mixed up with the riffraff but as "chief magistrate" tells :"here there's no more difference" .<br /><br />Not a major work for Borzage (neither is Lang's version),but to seek out if you are interested in the great director's career.

<br / >とか、強調で使われる""とか、ちょいちょい弾きたいものがありそうな感じ。

これを弾く。

make_tsv_fileで先にレビューからタブを削除して、レビューの文章 \t 1 or 0に整形してファイルに出力する。
tokenizer_with_preprocessingは、対象のテキストをpreprocessing_textを通してからtokunizer_punctuationに通す。
preprocessing_textは、.と,以外の記号(!"#$%&()*'+,-./:;<=>?@[\]^_{|}~)を削除するのと、
を削除する、記号削除関数的な感じ。
tokunizer_punctuationは、これまで日本語の分かち書きをしていたものと同じ動き方で、"I like cats"を["I", "like", "cats"]みたいに配列にするためのもの。
日本語と異なり英語は、半角スペースで単語が分割されているので、text.split()でリストに別れる。めちゃ楽。
これにより、レビューのテキストと、それを前処理するための処理が整ったので、次はdataloaderとかの作成かな。

def make_tsv_file(data_path, kind):
    if os.path.exists(os.path.join(data_path, 'IMDb_' + kind + '.tsv')):
        os.remove(os.path.join(data_path, 'IMDb_' + kind + '.tsv'))

    pos_data_path = os.path.join(data_path, kind, 'pos')
    pos_files = glob.glob(os.path.join(pos_data_path, '*.txt'))
    neg_data_path = os.path.join(data_path, kind, 'neg')
    neg_files = glob.glob(os.path.join(neg_data_path, '*.txt'))

    with open(os.path.join(data_path, 'IMDb_' + kind + '.tsv'), 'a') as outf:
        for target_files in (pos_files, neg_files):
            for text_file in target_files:
                with open(text_file, 'r', encoding='utf-8') as inf:
                    text = inf.readline()
                    text = text.replace('\t', ' ')  # tsvにしたいので先にタブを半角スペースに変換。
                    text = '\t'.join([text, '1', '\n'])
                    outf.write(text)


def preprocessing_text(text):
    text = re.sub('<br />', '', text)

    for symbol in string.punctuation:
        if symbol != '.' and symbol != ',':  # .と,以外の記号を削除する。
            text = text.replace(symbol, '')

    text = text.replace('.', ' . ')  # .は前後に半角スペースを入れることで、一つの単語的に扱う。
    text = text.replace(',', ' , ')  # ,も同上  (これをしないと、.が付いた単語が別の単語と扱われてしまうから)
    return text


def tokunizer_punctuation(text):
    return text.strip().split()  # 記号を半角スペースに置き換えているので、前後の半角スペースをstripで削除して半角スペースで単語リスト作成


def tokenizer_with_preprocessing(text):
    text = preprocessing_text(text)
    results = tokunizer_punctuation(text)
    return results


def main():
    data_path = os.path.join('/path, 'to', 'aclImdb')
    if not os.path.exists(os.path.join(data_path, 'IMDb_train.tsv')):
        make_tsv_file(data_path, 'train')
    if not os.path.exists(os.path.join(data_path, 'IMDb_test.tsv')):
        make_tsv_file(data_path, 'test')

    print(tokenizer_with_preprocessing('I like cats.'))



if __name__ == '__main__':
    main()