The classical collaborative filtering approach includes computing the cosine distance between all pairs of users or items. In the book Practical Recommeder Systems the author proposes a simple way to reduce this computational burden — apply K-means clustering and only compute the cosine distance intra-cluster. The rest of this article will attempt to benchmark this idea on the MovieLens dataset.

For the default implementation, we will simply use the cosine_similarity function from sklearn. For example, given a matrix M:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(M, M, dense_output=False)

The patched implementation is as follows:

  1. compute KMeans on the full dataset
  2. for each cluster compute the cosine similarity between items in the cluster
def cosine_similarity_kmeans(m, num_clusters=4):
    kmeans = KMeans(num_clusters,n_init=3,max_iter=1000)
    kmeans.fit(m)
    labels = kmeans.labels_
    sims = []
    for cluster_id in range(num_clusters):
        cluster_rows = m[labels==cluster_id]
        if cluster_rows.shape[0]< 10:
            display(f"WARNING: small cluster with only {cluster_rows.shape[0]} rows")
        sim = cosine_similarity(cluster_rows, cluster_rows, dense_output=False)
        sims.append(sim)
    return sims, kmeans

For a varying number of clusters, I ran the algorithm on the first 40k users from the ML dataset. The times were as follows:

As we can see, with the appropriate number of clusters, we can decrease the processing time from 51 seconds to 17 seconds.

To recap: we can leverage a clustering algorithm to split the dataset into higher-density clusters and only compute similarities between pairs of items in their respective clusters. We used this technique on a subset of the MovieLens dataset and achieved a speedup of 66%.