pytorchでNeural Collaborative Filtering(その1 データ準備)

書いてる理由

レコメンド * deep learningやりたい
まずは有名どころを真似てみる

参考

概要

レコメンドをdeep learingを使ってやりたい。
Neural Collaborative Filteringの論文をベースにpytorchで組まれているコードがあったので、それを真似して書いて動作を確認する。
コードは、[1]Generalized Matrix Factorizationと[2]Malti Layer Perceptronとこれらを組み合わせた[3]Neural matrix factorizationがあり、まず[1]を確認。

コード

github.com

詳細

レコメンドはNetflixのコンテストでMatrix FactorizationがTOPを取ってから、これがデファクトになっている。
Matrix FactorizationはCollaborative Filteringのuser * itemの行列を次元圧縮した手法。

Collaborative Filteringとは、以下の図のようにuser1~3がitem1~5を見た見ていないが1/0で表現されているとして、user4がitem1~5を見た見ていないというデータが与えられると、
user4とuser1~3のアイテムを見た/見てないという情報からuser同士の距離が測れて、レコメンド対象のユーザー(user4)に近いユーザー(user1)が見ているが、対象ユーザーがまだ見ていないitemをオススメするとその人の趣向にあったレコメンドができるというもの。
この時、距離計算をするベクトルの次元は num of user * num of itemとなり、計算量が膨大になるため、
そのベクトルを近似的に表して次元圧縮することで精度を落とさずに高速化するのがMatrix Factorization。

f:id:raishi12:20200404232246p:plain — collaborative filtering

上の図の右側に、Matrix Factorizationの課題が表現されていて、user4はuser1に最も近くてuser2よりもuser3に近いのだけど、
ベクトル空間上でどうしてもそれを表現しきれなくなる場合が存在する。
これはベクトル空間の次元をあげれば回避できるが、次元数を高く保つとオーバーフィッティングになってしまう。

The above example shows the possible limitation of MF caused by the use of a simple and fixed inner product to estimate complex user–item interactions in the low-dimensional latent space. We note that one way to resolve the issue is to use a large number of latent factors K. However, it may adversely hurt the generalization of the model (e.g., overfitting the data), especially in sparse settings [26]. In this work, we address the limitation by learning the interaction function using DNNs from data.

これを回避するためにDeep Neural Networkを使う。

コード解説

データ準備

今回利用するのは、Movie lensのデータセットで、約6000ものユーザーが4000近くの映画を1~5段階に評価した時系列データ。
利用するのは、ratings.datだけ。
ratings.datは以下のように::をセパレータに、user_id, movie_id, rating, timestampが格納されたデータ。

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968

これを読み込み、中で扱いやすくするためuser_idとmovie_idを0からの連番に変更する。
preprocess_dataset(ml1m_rating)で返ってきたデータは、以下のように0〜の連番でuserIdとitemIdが振られてratingがts順にならぶ。

itemId	rating	ts
0	5	978300760
1	3	978302109
2	3	978301968
3	4	978300275
4	5	978824291

def preprocess_dataset(ml1m_rating):
    user_id = ml1m_rating[['uid']].drop_duplicates().reindex()  # ['uid']で取るとSeries型になるので[['uid']]で取得
    user_id['userId'] = np.arange(len(user_id))
    ml1m_rating = pd.merge(ml1m_rating, user_id, on=['uid'], how='left')
    item_id = ml1m_rating[['mid']].drop_duplicates()
    item_id['itemId'] = np.arange(len(item_id))
    ml1m_rating = pd.merge(ml1m_rating, item_id, on='mid', how='left')
    ml1m_rating = ml1m_rating[['userId', 'itemId', 'rating', 'ts']]
    print('Range of userId is [{}, {}]'.format(ml1m_rating.userId.min(), ml1m_rating.userId.max()))
    print('Range of itemId is [{}, {}]'.format(ml1m_rating.itemId.min(), ml1m_rating.itemId.max()))

    return ml1m_rating

def main():
    # config setting
    data_root = os.path.join('.', 'data', 'ml-1m')
    rating_file = os.path.join(data_root, 'ratings.dat')

    # load rating file
    ml1m_rating = pd.read_csv(rating_file, sep='::', header=None, names=['uid', 'mid', 'rating', 'ts'], engine='python')
    ml1m_rating = preprocess_dataset(ml1m_rating)

学習まで全部やったけど、一旦今日はここまで。
追記する形でまた明日。