Matrix Factorization

Notice

Recent Posts

Recent Comments

Today

Total

250x250

관리 메뉴

ROKO

Matrix Factorization 본문

Machine Learning

Matrix Factorization

RO_KO 2023. 4. 18. 15:32

728x90

Overview

PCA = linear AE
PCA ~ MF
MF는 matrix completion 문제로 확장하여 볼 수 있다

PCA와 MF의 차이점은 무엇일까?

Matrix Factorization은 SVD(Singular Value Decompostion)를 활용하기 때문에 정확히 비교하자면 PCA vs SVD를 살펴보아야한다.

PCA -> \(X ≈ UZ\)

SVD -> \(X = U\Sigma V^T\)

\(X\)가 mXn matrix라고 했을때,

\(U\)는 mXm matrix인 orthonormal columns 이다.

\(V\)는 nXn matrix인 orthonormal rows이다.

\(\Sigma\)은 mXn diagonal matrix로써 min(m,n) 개수(rank)만큼의 non-negative singular values를 가지고 있다. 이는 PCA에서의 주성분을 나타내던 eigenvalue와 같은 역할을 한다.

(conventional하게 특이값은 내림차순으로 정렬되어 있다고 가정)

PCA와 SVD의 가장 큰 차이점은 PCA는 가역행렬에 대해서만 적용가능하나 SVD는 임의의 mXn 행렬에 적용가능하다는 것이다.

SVD는 PCA와 같이 \(rank(\Sigma)\)중에서 k개만큼 값을 뽑아 전체 행렬을 근사한다. (Rank-K approximation)

또한 모두 데이터의 차원을 줄임으로써 시각화에 용이하게 하고 다중공산성을 줄이며, 차원의 저주를 최소화한다. 하지만 "데이터의 표현력이 줄어드므로 안좋은게 아닐까" 하는 생각이 든다. 이 부분에 대해서 어떻게 생각할 수 있을까?

우선 차원의 저주는 심각한 문제다. 아마 PCA내용에서도 언급했겠지만 차원의 수가 조금만 커지더라도 현실세계에서 충분한 양을 모을 수 없다. 따라서 우리는 너무 많은 차원보다 어느정도 성능과 표현력을 보장하는 적당한 차원을 찾는 게 중요하다.

SVD는 보통 추천시스템에서 많이 쓰이는데 U와 V를 각각 user, item의 정보로 보아 X feature matrix를 근사하여 개인화(Personalization) 추천 알고리즘을 적용한다. 물론 모든 user와 item정보를 사용하면 더 좋은 추천이 가능할 것이다.

하지만 user가 10억명이라면? item이 10억개라면?

우리는 추천 한번을 일생동안 받지 못할지도 모른다. 추천알고리즘은 또한 가장 확률이 높은 하나만 추천해주기보다 다양한 추천을 보여주기에 단일 성능을 높일 필요가 없고 높은 추천성능을 유지하되 사용자가 불편함을 느끼지않는 빠른 시간복잡도를 가지고 있어야한다. 이러한 목적을 기준으로 볼때 MF는 좋은 알고리즘 기법이다.

그리고 대부분의 X는 sparse matrix로 구성되어있어 비효율적으로 많은 공간을 차지하는 문제도 해결할 수 있다.

Example) The Netflix Problem

각 유저는 모든 content가 아닌 일부와 관계를 가지고 있다. 우리는 내제된 정보를 추론하여 유저가 좋아할 만할 content를 추천해줘야 한다.

현재 matrix 상태를 아래와 같이 생각 할 수 있다.

"?" 부분을 채우는 matrix completion problem를 풀어야 한다.

Data : Users rate some movies (very sparse)
Task : Finding missing data for recommending new movies to users
Algorithms : Alternating Least Square method, Gradient Descent, Non-negative Matrix Factorization, low rank matrix Completion, etc

Latent factor model

attempt to explain the ratings by characterizing both movies and users on a number of factors K

Alternating least squares

K-dimenstional space \(u_n\) is a user representation
\(z_m\) is a movie representation
\(R_{nm} ≈ u_n^Tz_m\)
This is a MF problem!

Let \(O=\{(n, m):entry(n,m) matrix \; R \; is \; observed\}\)

\(\underset{U,Z}{min}\sum_{(n,m)\in O}(R_nm-u_n^Tz_m)^2\)

The objective is non-convex and in fact it's NP-had to optimize.

(Low- Rank Matrix Approximation with Weights or Missing Data is NP-hard by Gillis and Glineur, 2011)

As a function of either U or Z individually, the problem is convex and easy to optimize.

-> Alternating Least Squares (ALS) : fix Z and optimize U, followed by fix U and optimize Z and so on until convergence.

ALS for Matrix Completion algorithm

Initialize U and Z randomly
repeat
for n=1,...,N do
\(u_n=(\sum_{m:(n,m)\in O}z_mz_m^T)^{-1}\sum_{m:(n,m)\in O}R_{nm}z_m\)
for m=1,...,M do
\(z_m=(\sum_{n:(n,m)\in O}u_nu_n^T)\sum_{n:(n,m)\in O R_{nm}u_n}\)
until convergence -> \(\underset{U,Z}{min}\sum_{(n,m)\in O}(R_{nm}-u_n^Tz_m)^2\)

Equation

\(R_{nm}=u_n^Tz_m\)

\(\leftrightarrow R_{nm}z_m^T=u_n^Tz_mz_m^T\)

\(\leftrightarrow R_{nm}z_m^T(z_mz_m^T)^{-1}=u_n^T\)

\(\leftrightarrow (z_mz_m^T)^{-1}R_{nm}z_m=u_m\)

SVD는 결측치를 사용하기 때문에 sparsity가 커질수록 성능에 보장이 어렵다. 하지만 ALS는 결측치를 사용하지 않으므로 sparsity에 강인함을 가지고 있다. 하지만 X가 커질수록 연산량이 늘어 최적화 시간이 오래 걸리는 단점이 있다.

It's possible to view K-means as a MF

"Reconstruction" of the data given by UZ

K-means distortion function in matrix form : \(\sum_{n=1}^N\sum_{k=1}^Kr_k^{(n)}||m_k-x^{(n)}||^2=||X-UZ||_F^2\)

영화 장르를 cluster center로 보고 유저들을 각 장르에 clustering 시키는 모습으로 연관지을 수 있다.

Co-clustering

clusters both the rows and columns of a data matrix
represent as the indicator matrix for rows, times the matrix of means for each block, times the indicator matrix for columns

Sparse Coding

Efficient coding hypothesis: the structure of our visual system is adapted to represent the visual world in an efficient way
Olshausen and Field fit a sparse coding model to natural imgaes to try to determine what's the most efficient represenation
Similar to, but more gerneal than PCA
small patches들을 이용해 선형 결합으로 전체를 표현하는것을 말한다

Suppose that dictionary of.basis functions \(\{a_k\}_{k=1}^K\) which can be combined to model each patch

Each patch is approximated as a linear combination of a small number of basis functions:

\(x=\sum_{k=1}^Ks_ka_k=As\) s is a sparse vector

This is an overcomplete representation, in that typically K>D.

The requirement that s is sparse makes things interesting.

Inference in the sparse coding model: \(\underset{s}{min} ||x-As||^2+\beta ||s||_1\) \(\beta\) is a hyperparameter that trades off "reconstruction error" vs "sparsity"

Loss fuction

\(a_k\) 제약이 필요한 이유는 Regularization termdㅔ 의해 s는 작아지면서 reconstruction error를 줄이기 위해 \(a_k\)가 커지기 때문에 이를 막기 위해서 이다.

Sparse coding loss function을 최적화하기 위해 alternating minimization scheme 방식을 사용할 수 있다. (such as K-means, EM, low-rank matrix completion, etc)

728x90

'Machine Learning' 카테고리의 다른 글

Statistics & Probability (0)	2023.03.30
Linear algebra for ML (0)	2023.03.26
GMM (Gaussian Mixture Model) (0)	2023.03.18
K-Means (0)	2023.03.16

'Machine Learning' Related Articles

Comments

ROKO

Matrix Factorization 본문

Matrix Factorization

Overview

PCA와 MF의 차이점은 무엇일까?

Example) The Netflix Problem

Latent factor model

Alternating least squares

It's possible to view K-means as a MF

Co-clustering

Sparse Coding

Loss fuction

'Machine Learning' 카테고리의 다른 글

티스토리툴바