Authors: YueWang, RuiqiXu, YulinChe, LeiChen, QiongLuo, RuiMao
Name of Conference: International Conference on Very Large Data Bases (VLDB 2021), Aug 16- 20, 2021, Copenhagen, Denmark
Date of Publication: Nov 11, 2020
Measuring similarities among different nodes is important in graph analysis. SimRank is one of the most popular similarity measures. Given a graph G(V, E) and a source node u, a single-source Sim-Rank query returns the similarities between u and each node v ∈ V .This type of query is often used in link prediction, personalized recommendation and spam detection. While dealing with a large graph is beyond the ability of a single machine due to its limited memory and computational power, it is necessary to process singlesource SimRank queries in a distributed environment, where the graph is partitioned and distributed across multiple machines. However, most current solutions are based on shared-memory model, where the whole graph is loaded into a shared memory and all processors can access the graph randomly. It is difficult to deploy such algorithms on shared-nothing model. In this paper, we present DISK, a distributed framework for processing single-source Sim-Rank queries. DISK follows the linearized formulation of SimRank,and consists of offline and online phases. In the offline phase, a tree-based method is used to estimate the diagonal correction matrix of SimRank accurately, and in the online phase, single-source similarities are computed iteratively. Under this framework, we propose different optimization techniques to boost the indexing and queries. DISK guarantees both accuracy and parallel scalability,which distinguishes itself from existing solutions. Its accuracy, efficiency, parallel scalability and scalability are also verified by extensive experimental studies. The experiments show that DISK scales up to graphs of billions of nodes and edges, and answers online queries within seconds, while ensuring the accuracy bounds.