Meta-learning enhanced next POI recommendation by leveraging check-ins from auxiliary cities

Most existing point-of-interest (POI) recommenders aim to capture user preference by employing city-level user historical check-ins, thus facilitating users' exploration of the city. However, the scarcity of city-level user check-ins brings a significant challenge to user preference learning. Although prior studies attempt to mitigate this challenge by exploiting various context information, e.g., spatio-temporal information, they ignore to transfer the knowledge (i.e., common behavioral pattern) from other relevant cities (i.e., auxiliary cities). In this paper, we investigate the effect of knowledge distilled from auxiliary cities and thus propose a novel Meta-learning Enhanced next POI Recommendation framework (MERec). The MERec leverages the correlation of check-in behaviors among various cities into the meta-learning paradigm to help infer user preference in the target city, by holding the principle of"paying more attention to more correlated knowledge". Particularly, a city-level correlation strategy is devised to attentively capture common patterns among cities, so as to transfer more relevant knowledge from more correlated cities. Extensive experiments verify the superiority of the proposed MERec against state-of-the-art algorithms.


Introduction
Next POI recommendation, which aims to recommend POIs for users that they are most likely to visit in the future, benefits both location-based social network services, e.g., Foursquare (foursquare.com), and individuals. As users' activities typically limit within a city, most existing studies exploit the city-level user check-in records to develop next POI recommenders. Table 1 shows the statistics of user-POI interactions for four cities on Foursquare, which are widely explored in prior studies [20,21]. We can observe that CAL with relatively higher density being 1.06%, while the extremely lower density is 0.05% in NYC. Obviously, the sparsity of user-POI interactions in many cities severely hinders the capability of existing approaches for more accurate user preference learning.
To ease this issue, various context information, e.g., spatial and temporal contexts, has been widely exploited in existing next POI recommenders. Specifically, most current research devotes to capturing the spatio-temporal relations between users and POIs. They are built upon various techniques, ranging from matrix factorization [9,16], Markov chain models [2], to advanced deep learning frameworks, e.g., recurrent neural networks [20] and graph neural networks [12]. However, they are restricted by insufficient training data for more accurate user preference learning due to the sparse user-POI interactions within a city.
Intuitively, users' check-in behaviors among different cities may share common patterns. This motivates us to conduct an in-depth analysis of the check-in records across different cities (i.e., auxiliary cities), and transfer useful knowledge from such cities for assisting user preference inference within the target city. However, non-overlapping visited POIs between different cities bring challenges in knowledge transfer, that is, blindly leveraging check-in behaviors from auxiliary cities to augment the target city may result in harmful knowledge transfer. We thus seek to investigate two fundamental problems when transferring knowledge from auxiliary cities to the target city as follows.
(1) What to transfer? In e-commerce, overlapping items can be found on shopping sites in different regions. While in the city-level location recommendation scenario, non-overlapping visited POIs across different cities present a challenge to transferring common behavioral knowledge. Fortunately, mining users' check-in behavioral knowledge over the categorical context (i.e., common category-level patterns) helps address this challenge. For example, the category transition Shop&Service→Food are common to all four cities, which indicates that users in different cities are most likely heading to a restaurant after shopping. By contrast, the transition Travel &Transport→Shop is quite common only in SIN due to the developed public transportation. (2) How to transfer? Although the common category-level patterns captured from auxiliary cities may enhance the recommendation quality for the target city, this inevitably introduces noise if we ignore the cultural diversity and geographical property of such cities. Hence, determining what extent we can transfer knowledge from the auxiliary cities to the target city is of great significance.
Accordingly, we propose a novel Meta-learning Enhanced next POI Recommendation (MERec) framework, which delicately considers the correlation of category-level behavioral patterns among different cities into the meta-learning paradigm, that is, paying more attention to more correlated knowledge. Specifi-cally, MERec mainly consists of two components: a two-channel encoder to capture the transition patterns of categories and POIs, whereby a city-correlation based strategy is devised to attentively capture common knowledge (i.e., patterns) from auxiliary cities via the meta-learning paradigm; and a city-specific decoder to aggregate the latent representations of the two channels to perform the next POI prediction on the target city.
Overall, our main contributions lie in three folds: (1) we are the first to study to what extent we can transfer knowledge from auxiliary cities to the target city via differentiating the correlation of category-level behavioral patterns; (2) we propose a novel meta-learning based framework -MERec, which exploits both the transferred knowledge and user behavioral contexts within the target city to alleviate the data sparsity issue; and (3) we conduct extensive experiments on four datasets to validate the superiority of MERec against state-of-the-arts.

Related Work
Next POI Recommendation. It predicts future POI visits for users based on their historical successive check-in behaviors. Early studies generally employ the property of Markov chain to model the sequential influence [2,18,5]. Recently, recurrent neural network (RNN) based methods show great capability in capturing long-term sequential dependencies. Existing studies based on RNN and its variants mainly tend to exploit users' sequential check-ins by incorporating various context information, such as ST-RNN [11], SERM [17], MCARNN [10] ATST-LSTM [8], and iMTL [20]. Despite the great success of these methods, most of them suffer from the issue of insufficient user check-ins in many cities, which heavily limits their performance improvements. In this sense, transferring knowledge from auxiliary cities to the target city brings the possibility to further enhance the user preference learning for the next POI recommendation.
Meta-learning for Next POI Recommendation. Transfer learning (TL) aims to transfer knowledge from source domains to the target domain, which has shown strong capability in resolving the sparsity issue. Existing TL-based approach [4] focuses on the cross-city POI recommendation task due to the lack of large amount of overlapping user-POI interactions across cities. Meta-learning (ML) is able to transfer the knowledge learned from multiple tasks to a new task and has been recently introduced in next POI recommendation. For example, Chen et al. [1] proposed CHAML by fusing hard sample mining and curriculum learning into a meta-learning framework. Sun et al. [13] devised MFNP to integrate user preference and region-dependent crowd preference tasks in a meta-learning paradigm. Cui et al. [3] designed Meta-SKR by using sequential, spatiotemporal, and social knowledge to recommend next POIs. Meanwhile, Tan et al. [15] developed the METAODE which models city-irrelevant and -specified information separately to achieve city-wide next POI recommendation. However, the aforementioned ML-based next POI recommenders ignore to attend the correlation of user behavioral patterns when transferring knowledge from auxiliary cities to the target city, i.e., paying more attention to more correlated knowledge.

Data Analysis
There is a great necessity to analyze the correlation among different cities w.r.t. user check-in behaviors (see Table 1), so as to better guide the knowledge transfer from auxiliary cities to the target city. It is, however, non-trivial due to the nonoverlapping visited POIs across cities. Fortunately, POIs in various cities share the same categories, which inspires us to study the POI distribution and user behavioral patterns at the category level to uncover the correlation among cities.
POI Distribution at Category Level. The number of POIs under each category varies a lot across cities due to different cultures and geography. Hence, we first study the nature of POI distributions among four cities to help explore the correlation of user behavioral patterns. Specifically, all POIs are characterized by ten first-level categories [14], Correlation of Cities w.r.t POI Distribution. The POI distribution of each city enables us to further explore the correlation between cities, i.e., measuring the similarity of cities from the aspect of POI distribution. Specifically, given any two cities, poi |C| ] denote the POI distributions among |C| categories within city A and city B, respectively. We thus derive their similarity γ A,B via the Pearson correlation coefficient, and the results are shown in Fig. 2(a). We find that NYC shows the highest similarity with PHO while the lowest similarity with SIN, implying that cities in the same country (i.e., USA) may have a higher correlation due to the similar property of culture. Besides, CAL (i.e., Canada) shows relatively higher similarity with NYC and PHO, which means that the geography property is also an important factor when measuring the correlation of cities. Although the correlation of cities can be measured from the aspect of POI distribution, the user behavioral transition pattern is a significant factor in the next POI recommendation task, we thus further explore such correlation from the angle of user sequential behaviors.   Fig. 2(b). Interestingly, we observe that the correlation of cities w.r.t behavioral patterns is quite different from that w.r.t POI distribution. Specifically, PHO and CAL still keep higher similarity, whereas NYC shows comparably lower similarity with PHO and CAL. To further dig out how the four cities are correlated and different over the behavioral patterns, we compare the two most correlated cities (i.e., CAL and PHO) and the two least correlated cities (i.e., NYC and SIN). For ease of presentation, we select the 10 most frequent category transitions for comparison as shown in Fig. 2(c-d), where the x-axis denotes the category transitions, e.g., AE → CU (AE2CU), and the y−axis shows the proportion of such a transition within a city. We find that the more correlated cities possess consistent distributions over the frequent category transitions and vice versa. The above observations depict the various correlations between cities, which inspire us to differentiate their influence when transferring knowledge from auxiliary cities to the target city.

The Proposed MERec
This section presents the proposed MERec, which leverages the correlation of behavioral patterns when transferring knowledge from auxiliary cities to the target city, i.e., paying more attention to more correlated knowledge.
Problem Formulation. Each city has its unique user set U and POI set P without sharing any common users and POIs. For user u, all his check-in records, i.e., r = (p, c, g, t), are ordered by timestamps as in [22], where p, c, g, t denote POI p, category c, coordinate g (i.e., longitude and latitude) and timestamp t. We then split his historical records into sequences by day and obtain two types of sequences: 1) the i-th category sequence denoted by a set of category tuples, i.e., C u,i = {C u t1 , C u t2 , · · · , C u tn }, where C u t k = (c u t k , t u k ), and 2) the i-th POI sequence denoted by a set of POI tuples, i.e., P u,i = {P u t1 , P u t2 , · · · , P u tn }, where P u t k =

POI-Level Encoder City-Specific Decoder
Target City

All Cities
Auxiliary Cities (p u t k , d u t k , t u k ), and d t k is the distance between successive POIs calculated by their coordinates. Given C u,i , P u,i , auxiliary cities Y A = {y (m) aux |m ∈ 1, 2, · · · , M } and the target city Y T = {y tar }, our goal is to predict user u's next POI p tn+1 at time t n+1 by transferring knowledge from the auxiliary cities to the target city.
Overview of MERec. The overview of MERec is outlined in Fig. 3, mainly composed of a two-channel encoder (i.e., category-and POI-level encoders) with the embedding layer and a city-specific decoder. In particular, the category-level encoder exploits meta-learning to capture the common user check-in transition patterns at the category level in each city by holding the principle of "paying more attention to more correlated knowledge". The goal of the POI-level encoder is to learn the accurate POI transition patterns in the target city. Lastly, the city-specific decoder performs the next POI predictions by concatenating the hidden states of the above two encoders.
Embedding Layer. It maps each check-in record into an embedding vector. Specifically, in the category-level encoder, the embedding of a category tuple e C ∈ R 2d is the concatenation of the category embedding e c ∈ R d and time embedding e t ∈ R d ; thus the embedding of a category sequence C u,i is formed as t1 , e C t2 , · · · , e C tn ]. Analogously, in the POI-level encoder, the embedding of a POI sequence is denoted by E P u,i = [e P t1 , e P t2 , · · · , e P tn ], where e P is the embedding of POI tuple represented by the concatenation of POI embedding e p ∈ R d , distance embedding e dist ∈ R d and time embedding e t ∈ R d .
Cateogry-level Encoder. To distil knowledge from auxiliary cities and employ category-level user behavioral patterns, we extend model-agnostic meta-learning (MAML) [6] with LSTM as the framework for the meta-learning update. In particular, we devise a correlation strategy that can transfer knowledge based on the correlation of user behavioral patterns among cities. Meanwhile, freezing layers and model fine-tuning are exploited to obtain a generic model while better adapting to the data of the target city. Meta-learning Setup. Following [1], the recommendation within each city, including the auxiliary and target cities, can be viewed as a single task (with its own dataset D) in a meta-learning paradigm. Thus, the check-in sequences of auxiliary cities Y A are denoted as D test . We treat each city y m as a meta-learning task, where each task has support set D spt ym for training and a query set D qry ym for testing. Finally, our goal is to leverage the data from both auxiliary cities and the target city, i.e., D train = D train , to learn a metalearner F w , where w is its parameters. Accordingly, given the support sets, F w predicts the parameters θ of recommender f θ to minimize the recommendation loss on the query sets across all cities as follows, Specifically, each iteration of MAML includes local update and global update on the sampled task batch, where the first phase updates θ locally on D spt of each task, and the second phase globally updates θ by gradient descent to minimize the sum of loss on D qry of all tasks.
-Local update: we first sample a batch of cities, and then randomly sample N category sequences D spt ym and D qry ym for each sampled city. Thus, we calculate the training loss on D spt ym and locally update θ by one step: where L is the cross-entropy loss; α is the local learning rate, and θ ′ ym is the locally updated parameters of recommender for each city.
-Global update: we calculate the testing loss on each D qry ym with the corresponding θ ′ ym and then update the initialization θ by one gradient step on the sum of testing losses across all cities, where β is the global learning rate.
Correlation Strategy. From the data analysis in Section 3, we observe that there exist various correlations w.r.t different aspects among different cities. Directly transferring user check-in behaviors from auxiliary cities to the target city may introduce noise thus hurting the recommendation performance. By holding the principle of "paying more attention to more correlated knowledge", we further consider the correlation of behavioral patterns at category level in different cities when conducting the global update. To be specific, we obtain the city-level correlation (e.g., γ cor ) based on behavioral patterns, and then attentively adapt the gradient across cities by employing their correlations. In other words, if the auxiliary city is more correlated to the target city, we adapt the gradient so that it updates faster in that direction. Therefore, Eq.(2) is reformulated as: Freezing Layers and Model Fine-Tuning. Inspired by [19], the network with freezing layers and fine-tuning is generalized better than the one trained directly on the target dataset. Therefore, after obtaining the well-trained category-level encoder for the target city (i.e., LST M tar cat ) by the meta-learning paradigm, we further consider fine-tuning it. In doing this, we can deliver a network that not only accommodates knowledge distilled from the auxiliary cities but also better adapts to the target city. Specifically, assuming LST M tar cat contains L layers, we freeze its first l (1 ≤ l ≤ L) layers, while adding n layers after the l layers. The newly constructed model is denoted by LST M tar cat , which is further fine-tuned via category sequences from the target city, i.e., D (tar) train . As such, the freezing-layers help generate a network that can better balance parameters between auxiliary cities and the target city after the fine-tuning. Accordingly, the hidden state h u t k of category at t k is given by, POI-level Encoder. It aims to model users' sequential check-in behaviors and the spatio-temporal context in the target city by using the LSTM model. As illustrated in the Embedding Layer, the embedding of a POI sequence is represented by E P u,i = [e P t1 , e P t2 , · · · , e P tn ], where each embedding e P t k is feed into the LST M tar poi to infer the hidden state h u tk of POI check-in at t k , given by, City-specific Decoder. The city-specific decoder aims to perform the next POI prediction based on the last hidden states learned from the two-channel encoder (i.e., h u tn , h u tn ). Accordingly, the probability distribution on all candidate POIs is calculated by the softmax function, given by, where f is a fully connected layer to transform (h u tn ; h u tn ) into a |P|-dimensional vector; and |P| is the number of POIs in the target city. Hence, the objective function for the next POI recommendation is defined by: where y is a one-hot embedding of the ground-truth POI. Algo. 1 shows the training process of MERec, consisting of meta training (lines 3-9), freezing layers and model fine-tuning (lines 10-12), as well as next POI prediction (lines 13-14).

Experiments and Results
We conduct experiments to answer three research questions: (RQ1) does MERec outperform state-of-the-art baselines? (RQ2) how do different components of MERec affect its performance? (RQ3) how do essential hyper-parameters affect MERec? The code is available at https://github.com/oli-wang/MERec.

Algorithm 1: The training process of MERec
Input: Dtrain, YA, YT , α, β, Iter, N, l, n Output: A list of recommended next POIs 1 Randomly initialize parameters θ; 2 Calculate the correlation of behavioral patterns at category level; 3 for (iter = 1; iter ≤ Iter; iter + +) do Datasets and Evaluation Metrics. The four datasets shown in Table 1 are used in our experiment, where we take one of the cities as the target city and the rest as auxiliary cities each time. Following [8], we chronologically divide the dataset of the target city into training, validation, and test sets with a ratio of 8:1:1. Note that we remove users and POIs with less than five and three checkins, respectively. Two commonly-used metrics, i.e., HR@K and N DCG@K are adopted by following [1], where the former measures whether the ground-truth POI can be found in the top-K recommendation list, and the latter measures the ranking quality of the ground-truth POI in the recommendation list. Compared Baselines. We compare the MERec with seven state-of-the-art approaches. (1) MostPop recommends the next POI based on the popularity of POIs; (2) BPRMF is a matrix factorization method optimized via Bayesian personalized ranking; (3) NeuMF [7] generalizes the matrix factorization by employing a multi-layer perceptron to model the user-item interactions; (4) ATST-LSTM [8] is an attention-based LSTM method by considering spatio-temporal contextual information; (5) iMTL [20] is a multi-task learning framework for next POI recommendation, which consists of a two-channel encoder and a task-specific decoder; (6) MAML [6] is a model-agnostic meta-learning for few-shot learning tasks; (7) CHAML [1] is a meta-learning based framework for next POI recommendation, which considers both city-and user-level hardness during meta training. Hyper-parameter Settings. The optimal hyper-parameter settings for all methods are empirically found out based on the performance on the validation set. Specifically, the embedding size is searched from {32, 64, 128, 256}. For baselines (2)(3)(4)(5), the learning rate is selected from {0.1, 0.05, 0.01, 0.005, 0.001, 0.0001}, and the batch size is set as 256. For meta-learning based baselines (6-7) and MERec, the learning rates α, β are searched from {0.5, 0.1, 0.01, 0.001, 0.0001}; and the batch size is set as 256 for a fair comparison. For MERec, the number of freezing layers l is searched in the range of [1,4] stepped by one, where the best setting is 3 for all cities; and Iter = 500, N = 32, n = 2 across all cities.
Performance Comparison (RQ1). The results are presented in Table 2. Across the four datasets, the traditional methods (MostPop, BPRMF) generally perform worse than deep learning methods (NeuMF, ATST-LSTM, iMTL) demonstrating the efficacy of neural networks on more accurate recommendation. RNN based methods (ATST-LSTM, iMTL) outperform NeuMF, which indicates the capability of RNN on modeling the sequential dependency. iMTL defeats ATST-LSTM, as it leverages multi-task learning (MTL) framework to jointly learn user preference on both categories and POIs, exhibiting the superiority of MTL on better next POI recommendation. Meta-learning based methods (MAML, CHAML, MERec) bring further enhancement compared with other methods, showcasing the efficacy of knowledge transfer in alleviating the data sparsity issue. Overall, our MERec consistently achieves the best performance across all the datasets, with an average lift of 6.3% and 6.53% w.r.t. HR and NDCG, respectively. This helps confirm the benefits of (1) leveraging check-ins of auxiliary cities to augment the target city, and (2) paying more attention to more correlated knowledge when transferring knowledge from auxiliary cities.  the category-level encoder, but only retains the POI-level encoder. The results are shown in Fig. 4. We note that MERec w/o cor−f rz performs worse than both MERec w/o cor and MERec w/o f rz , suggesting that both the correlation strategy, freezing layers, and fine-tuning operation indeed improve the recommendation performance. Generally, the performance decrease of MERec w/o f rz far exceeds that of MERec w/o cor , implying that the freezing layers and fine-tuning operation play more important roles than the correlation strategy. Besides, MERec w/o cat underperforms MERec, which helps verify the advantages of both the metalearning paradigm with auxiliary check-ins and the correlation strategy.
Parameter Sensitivity Analysis (RQ3). We study the influence of two essential hyper-parameters, i.e., the number of local-update steps in Eq.(2) and the number of freezing layers. Fig. 5 only reports the results on the CAL dataset and similar trends can be observed on the rest three datasets. Figs. 5 (a-b) depict the model performance w.r.t. the number of local-update steps. We empirically find out that updating only one step is sufficient to obtain better recommendation accuracy, which also increases the model efficiency. Figs. 5 (c-d) display the influence of the number of layers frozen on the model performance. As observed, with the layer increasing, the performance first goes up and then drops slightly. The best setting for the number of freezing layers is 3 on the four datasets.

Conclusion
In this paper, we propose a Meta-learning Recommendation (MERec) framework for the next POI recommendation by leveraging check-ins from auxiliary cities to augment the target city, and holding the principle of "paying more attention to more correlated knowledge". In particular, we devise a two-channel encoder to capture the transition patterns of categories and POIs, whereby a city-correlation based strategy is devised to attentively capture common knowledge (i.e., pat-terns) from auxiliary cities via the meta-learning paradigm. The city-specific decoder then concatenates the latent representations of the two-channel encoder to perform the next POI prediction for the target city. Extensive experiments on four real-world datasets demonstrate the superiority of our proposed MERec.