Expressive Latent Feature Modelling for Explainable Matrix Factorisation-based Recommender Systems

The traditional matrix factorisation (MF)-based recommender system methods, despite their success in making the recommendation, lack explainable recommendations as the produced latent features are meaningless and cannot explain the recommendation. This article introduces an MF-based explainable recommender system framework that utilises the user-item rating data and the available item information to model meaningful user and item latent features. These features are exploited to enhance the rating prediction accuracy and the recommendation explainability. Our proposed feature-based explainable recommender system framework utilises these meaningful user and item latent features to explain the recommendation without relying on private or outer data. The recommendations are explained to the user using text message and bar chart. Our proposed model has been evaluated in terms of the rating prediction accuracy and the reasonableness of the explanation using six real-world benchmark datasets for movies, books, video games, and fashion recommendation systems. The results show that the proposed model can produce accurate explainable recommendations.

Expressive Latent Feature Modelling 20:3 CF method that is able to provide accurate and explainable recommendation utilising only local data that are available in the same domain without incorporating any outer data or private data (e.g., user personal data). The raw data of RS usually comprises user-item rating data, with a possibility of providing some item side information in some datasets. We consider these data, if they are available, to be local data, as they exist in the same domain. To this end, we propose an MF-based explainable recommender system model that utilises only the available local data for providing accurate and explainable recommendations. Different to the existing ER methods, our proposed method is able to provide accurate and explainable recommendations without relying on outer or private data. To the best of our knowledge, the existing ER methods fails to provide explanations when the only available data are user-item rating data. In contrast, our proposed method can handle datasets with only user-item rating data and is able to provide accurate and explainable recommendations even in the case of the absence of item side information. We introduce WSLER: a latent feature modelling method that provides meaningful latent features for users and items, which enhances the accuracy of the MF model and produces reasonable explanations of the recommendations. In this research, we extend our previous work [2] to incorporate the users and items features for explainable recommendations.
In previous work [2], we introduced WAFE: a Weighted Average-based Feature Extraction method that extracts useful users and items features, and SVD-LM (SVD with Local Mean), which incorporates the local means of the users' and items' rating in the prediction model. We also introduced WSL, which utilises WAFE for latent features modelling and SVD-LM for solving the rating prediction problem. In this article, we propose an MF-based explainable recommendation model that extends WSL to utilise user and item features for explaining the recommendation. The main contributions of this article are to: • propose a feature-based explainable matrix factorisation-based recommender system model that provides accurate and reasonable recommendation explanation, • introduce an explanation representation method that adopts both text and bar chart to represent the recommendation explanation to the user, • evaluate our proposed method using six real-world benchmark datasets, and • compare the performance of our proposed methods with five related works using the relevant dataset.
The remaining of this article is organised as follows: Section 2 presents the literature review, and Section 3 presents the SVD foundation. Our proposed WSLER framework is presented in Section 4 followed by its three components that are explained in Sections 5, 6, and 7. The evaluation of our method is explained in Section 8, and we discuss the results in Section 9. The work is concluded in Section 10.

LITERATURE REVIEW
The recognition of the importance of recommendation explanation started in the late '90s. For instance, in 1999, Schafer et al. [28] pointed out that the use of recommender systems should include an explanation of what the user has been recommended, which would increase the user's satisfaction. In addition, in the work of Herlocker et al. [10], the authors emphasised that the chance of users to accept the recommendation produced by collaborative filtering would be increased with the explanation of the recommendation. In their study, they concluded that the most convincing component is to explain to the user why they have been recommended with the suggested recommendation. However, they only examine the effectiveness of the recommendation explanation in terms of the promotion aspect. Similarly, in 2002, Sinha and Swearingen [30] suggested that one of the main aspects of increasing the recommender system reliability to the user is to explain why the recommendation is performed. Besides, Bilgic and Mooney [3] investigated the effectiveness of the recommendation explanation in the users' opinion about the recommended item. Both References [3,30] considered the users satisfaction aspect to evaluate the recommendation explanation. However, the formal introduction to the term explainable recommendation has been introduced recently [1,12,34].
The two main approaches of recommender systems, collaborative filtering (CF) and contentbased (CB), both have received remarkable attention by researchers in the early recommender systems methods [25]. While CB approaches, which mainly rely on the item's content information, can straightforwardly produce explanation to justify the recommendation to the user, it would be more time-consuming to collect such information especially if they do not exist in the same domain (outer data) [33]. However, models that adopt the approach of CF [6] seem to overcome this limitation, as they rely on analysing users and items interaction history without involving outer data. However, the ability to justify the recommendation to the user seems to be more difficult in CF-based methods compared to CB-based methods [33].
CF-based recommendation methods have shown remarkable success, especially when integrating with latent factor models [13], which increase the success of collaborative filtering methods even more. Latent factor models [27] have been widely adopted in the recommender system application and research. For example, methods such as NMF [17], SVD, SVD++ [13], and PMF [21] have received a great attention in the literature. Nevertheless, latent factors produced by these models are meaningless and would not be useful to explain why a recommendation is performed. The researchers of recommender systems, as a result, have become more interested in Explainable Recommendation Systems (ERS). In ERS methods, the main goal is not only to provide an accurate recommendation to the users but also to explain why an item is recommended to the user. For instance, in the work of Reference [34], to provide an explainable recommendation, they propose an Explicit Factor Model (EFM) that joins the latent dimensions with explicit features.
Vig et al. [32] adopted movie tags as features to generate recommendations and explanations. To explain the recommended movie, the system displays the movie features and tells the user why each feature is relevant to her. One limitation of this method is that as the features extracted based on tags that are produced by users themselves, they may not always describe the item correctly (tag quality). However, in our model, we extract items' features from the exact domain knowledge, which typically describe them.
Zhao et al. [36] represented products and users in the same demographic feature space and used the weights of the features learned by a ranking function to explain the results. The limitation of this method is that the user's features are extracted from their profiles at social networks. Item's features are extracted from users' reviews and profiles in the social networks (privacy and crossdomain issues). In our method, we adopt only information in the same domain without adopting any user's personal information. Reference [35] further explored demographic information in the social media environment for product recommendation with feature-based explanations.
Abdollahi and Nasraoui [1] adopted the k-nearest neighbour approach for producing an explainability matrix representing the weight (score) of each rating to be explainable. Entities in the explainability matrix are involved in the rating prediction task using the MF concept. Since the explainability matrix is calculated based on the concept of k-NN, its produced values are limited to the available ratings in the nearest users/items neighbours. In our method, we calculate the explainability score based on the user's rating history without relying on their neighbours.
Hou et al. [12] used radar charts to explain why an item is recommended and why others are not. In their model, they extract aspects from user's text reviews, adopt them as features, and then link all items and users to these aspects. However, features are extracted from text reviews, which have less availability than ratings, and requires more computational cost for text mining. In our method, we extract features only form user-item rating data and provide meaningful factors that can be used for explaining the recommendation.
In summary, the existing proposed explainable recommended system approaches mainly adopt information such as users' text reviews, users' demographic information, and items' tags. However, this information's adoption suffers from three limitations: (i) user privacy, (ii) cross-domain access difficulty, and (iii) high analysis cost. In our method, we adopt only data that are not private, exist in the same domain, and are easy to collect and analyse.

SINGULAR VALUE DECOMPOSITION (SVD) FOUNDATION
Singular Value Decomposition (SVD) is one of the most successful MF methods used in the literature for recommender systems. In MF recommender system models, where the user-item rating data are stored in the rating matrix R ∈ R n×m , where n and m represent the number of users and items, respectively, joint latent factors of size d are defined, so all users and items are assigned to d latent features. Assuming that U is a set of n users and I is a set of m items, users' latent features are stored in P ∈ R n×d and all items' latent features are stored in Q ∈ R m×d . This is practically performed by factorising the high-rank rating matrix R into two lower rank matrices P and Q, in which R ≈R = PQ T , whereR ∈ R n×m is the estimated values of R.
Traditionally, the SVD method relies on defining the dimension of latent factors d and initialising the entries of P and Q using random values. These initial values are updated throughout a learning process until achieving their optimal values that can estimate the entries ofR using Equation (1): wherer ui is the estimated rating given by user u to the item i, p u is the vector of user u latent features, q i is the vector of item i latent features, μ is the overall global average rating, b u and b i are the biases of user u and item i, respectively. This biased SVD model was proposed by Reference [13]. Typically, SVD method aims to minimise the rating prediction error by solving an optimisation problem in Equation (2): where κ is the set of the (u, i) pairs for all known r ui and λ is a constant that controls the extent of the regularisation.
Despite that SVD has received much attention in the literature showing a reasonable success, less attention has been given to the starting point of defining the latent factors and initialising user and item feature values. The traditional method randomly defines the number of latent factors d and initialises the values of user features matrix P and item features matrix Q with random values. However, this results in three limitations: (1) The random definition of d requires more effort to search its optimal value, as we need to try different values until achieving the best accuracy of the model. (2) Initialising P and Q with random values may lead to get the search stuck in a local minima.
(3) These latent factors are meaningless, so they cannot be used for explaining the resulting recommendation.
Thus, our proposed WSLER framework resolves these limitations by extending the SVD model to exploit raw data to improve the performance of the SVD model in terms of prediction accuracy and recommendation explainability.

WSLER FRAMEWORK
We introduce a new framework called WSLER, an SVD-based recommendation framework that extends the SVD to provide more accurate and explainable recommendations. It improves the mechanism of defining the latent factors, initialising the user and item features values, and improving the rating prediction model. It also introduces explanation criteria for explaining the recommendation.
Our proposed WSLER framework comprises three components, as follows: (1) WAFE component: WAFE is a feature engineering method that utilises raw data to produce meaningful user and item features. The engineered features are used to define the latent factors and items' and users' feature values (Figure 1(a)). The inputs of this component are the raw data, and it produces item and user features as outputs. This component is explained in Section 5. (2) WSL component: WSL extends the SVD method to meaningfully define latent factors and initialise the values of user and item features and improve the rating prediction model (Figure 1(b)). This component takes the item and user features produced by WAFE as inputs and produces the estimated rating matrix as outputs. We will explain this component in Section 6. (3) ER component: ER is an explainable recommendation model that utilises WSL for producing explainable recommendation (Figure 1(c)). This component takes the outputs of the WAFE component (i.e., the item and user features) and the output of the WSL component (i.e., the estimated rating matrix) as inputs. These inputs are utilised to process the explainable recommendations as outputs. We will explain this component in Section 7.
These three components are explained in the following text.

THE WEIGHTED AVERAGE-BASED FEATURE ENGINEERING (WAFE) COMPONENT
In this method, we extract information about users that can reflect their interest in items (user's preferences) by analysing their behaviours in their rating history. We aim to link each user directly with the features of items that they have already rated to be able to predict how they will behave with new items that have similar features values. The mechanism of WAFE is based on two stages: (1) Item feature engineering: Manually extract an item feature using information available in the raw data. We assume that this step can be done differently, in which all the resulted features values are encoded into categorical values. (2) User feature Extraction: Extracting user features from each item feature. These extracted features will represent the user interest of each item feature (user preferences).

Item Features Engineering
In some datasets, the items are provided with some side information, such as item name, year of release, and type or category. These data, if they are available, will enrich the item features. However, in some domains, this kind of information is not provided. Nevertheless, in our model, we introduce some useful item information that are extracted from the user-item rating data, which are available in any recommender system dataset. In the following text, we describe how these two kinds of item features are modelled. Features from side information: We extract this kind of item feature straightforwardly from the raw data representing some feature concepts. Each feature concept is represented using some feature values. For example, in a movie recommender system, the year of release is one feature concept. This feature concept has a defined range of values, i.e., the range of movie release years. The values of this range are grouped into periods (e.g., each group represents a period of 10 years). Recall that a movie can be assigned to only one year; this feature concept can be dealt with as a categorical variable, in which it is encoded using representation methods such as one-hot-encoding [15]. In this kind of encoding, if there are д categories for a feature concept, then д features are used to represent this feature concept, in which one feature is assigned to 1 representing that the item is corresponding to that feature and the rest will be zeros.
Another feature concept of a movie recommender system is the genre of the movie. Assuming that there are three possible genres that a movie can be categorised with, three values can be used to represent the feature concept of the movie's genre, i.e., дenre 1 , дenre 2 , and дenre 3 . Though, different from the feature concept "year of release, " the movie can be categorised with more than one genre. Hence, this is a special case of the categorical variable with a possibility of a movie to be corresponding to more than one genre. To reflect the degree to which a particular movie belongs to a certain genre, we introduced weights, which were spread evenly across all relevant genres. For example, if movie 1 corresponds to дenre 1 and дenre 3 , the values of its genre features дenre 1 , дenre 2 , and дenre 3 will be 0.5, 0, 0.5, respectively. Meanwhile, if movie 2 corresponds to only дenre 1 , then the values of its genre features дenre 1 , дenre 2 , and дenre 3 will be 1, 0, 0, respectively. This information can be extracted directly from the item's description if they are available. However, in some domains, where the side information is not provided, some other information can be derived from the user-item rating data, such as the item number of ratings and the item ratings average. The former can represent the item's popularity, while the latter can represent the item's reputation. This kind of information is represented as continuance values. However, they can be transformed into ordinal variables, e.g., converting them into levels (1 to 10) in which 1 refers to the low level and 10 refers to the high level. Hence, this kind of information can be easily represented using one-hot-encoding, as explained above, resulting in representing each feature using 10 columns (i.e., popularity1, popularity2, . . . ,popularity10 and reputation1, reputation2, . . . , reputation10). More information about feature engineering can be found in References [15,37].
Following this methodology for all available item information, we will end up with a set of item features V of size K. For each feature v j ∈ V , the value v i j , which is the value of feature v j for the item i is represented in a value between 0 and 1. This value represents the extent to which the item i belongs to the feature v j , i.e., 1 means that the item i fully corresponds to the feature v j and 0 means that the item is not corresponding to the feature.

User Features Extraction
The extracted features (as described in Section 5.1) can be considered to be the latent (joint) factors. Thus, we define d = K latent factors by extracting all possible item features following the same methodology. Each item feature v j ∈ V is considered to be a latent factor, and the set of item features V are the set of latent factors. As the items have already been linked with these factors (by their features values), similarly, user features need to be extracted so they can be linked to the defined factors. The following text is to explain how we extract the user features: For each defined latent factor v j ∈ V , we calculate the weighted average of the user u's ratings for all items that belong to the factor v j (i.e., items with v i j > 0 ). Thus, the value of the user latent feature v u j can be calculated using Equation (3): where R u is the set of all ratings assigned by user u, R v j u is the set of ratings assigned by user u to all items with v j > 0, and w u is the weight of the user's rating, which can be calculated using Equation (4) as follows: The resulting user feature values will represent the extent to which a user is interested in each feature. As the user's feature value is calculated based on their weighted average ratings, these values will be in the rating range depending on the domain. Therefore, all users will be linked to each feature with their values, in which a user feature of high value means that the user is interested in the feature and vice versa.
Applying this methodology for all users and items in the R, we can end up with a reasonably defined number of latent factors d along with meaningful values of item features and user features.

THE WAFE-BASED SVD-WITH LOCAL MEAN (WSL) COMPONENT
We propose WSL, a method that extends the SVD method in two aspects as follows: (1) WAFE-SVD: Instead of the random setting of the number of latent factors and the random initialisation of the user and item feature values (as it is adopted in SVD), WAFE-SVD uses WAFE for defining meaningful latent factors and assigning user and item features with reasonable initial values. This is explained in more detail in Section 6.1. (2) SVD-LM: This extension improves the prediction model of the SVD by introducing two new predictors. This introduction incorporates the local mean of user and item ratings for formalising the bias in addition to the global (overall) rating mean. This is explained in more detail in Section 6.2.
WSL combines these two SVD extension methods to leverage their advantages and enhance the prediction accuracy. The combination mechanism is explained in Section 6.3.

WAFE-based SVD (WAFE-SVD)
WAFE-SVD extends the SVD model to utilise WAFE to define meaningful latent factors and initialise user and item feature values reasonably. In contrast to the SVD model, instead of randomly setting the number of latent factors d, in WAFE-SVD, d is assigned to the number of the item/user features produced by WAFE. Besides, WAFE-SVD assigns the initial values of user feature matrix P and item features matrix Q with the values of user and item features that extracted by WAFE instead of initialising them randomly. This step improves the rating prediction accuracy and decreases the required number of iterations to achieve the optimal values of P and Q.
Users and items latent factors are defined to represent the joint intersection between items and users. For example, in a movie recommendation dataset, suppose that one latent factor represents action feature. A user and an item are linked with this factor by a feature value (i.e., a real number) representing the extent to which this movie belongs to action movies and how much the user likes action movies. Based on the same concept, WAFE is utilised for modelling latent (joint) features for items and users.
As explained in Section 5, WAFE produces the same number of features for both items and users, which is K. The values of the user and item features represent the extent to which each user and item is linked to each feature. These features are considered to be the latent factors, in which the number of the extracted features K is assigned to be the number of latent factors d and the initial values of P and Q are assigned with the user and item feature values. As a result, the latent factors here are assigned with meaningful information that describes the factor focus feature (e.g., action). Applying this methodology, we will end up with meaningfully defined latent factors and reasonably initialised values of item features Q and user features P. This initialisation reduces the initial prediction error, which will enhance the performance of the model.

SVD with Local Mean (SVD-LM)
The prediction model in Equation (1) can achieve a reasonable prediction accuracy; however, it only adopts the overall global rating mean and ignores the local rating mean for each user and item. The user rating means can reflect the user rating behaviour, while the mean of the ratings received by each item can reflect the item reputation. Thus, we incorporate these two values in our prediction model as it is shown in Equation (5): where μ u and μ i are the local rating means of user u and item i, respectively. The factor ( 1 3 ) is applied to average the resulting value of adding μ + b u + b i + p T u q i and μ u and μ i . The estimated value produced by the SVD prediction model (i.e., μ + b u + b i + p T u q i ) is supposed to be a value in the rating range (i.e., a value between the maximum and the minimum rating). However, an out-of-range value may sometimes be generated, especially in the early iterations. However, the values of μ u and μ i are always in the rating range. Thus, we adopt μ u and μ i as a kind of bounding the estimated rating value to be in the rating range by averaging these three values (the estimated value produced by the SVD prediction model, μ u and μ i ) by dividing them by the constant 3. As a result, if the estimated value produced by the SVD prediction model is higher or lower than the rating range, the adoption of μ u and μ i will decrease or increase this value to be in the rating range.
These newly introduced terms (i.e., μ u and μ i ) differ from the two biases terms (i.e., b u and b i ), as they refer to the actual rating average values of the user and the item, while b u and b i are initialised randomly and updated during the learning process.
The SVD-LM model gets the initial values of P and Q as inputs and applies the standard SVD methodology to learn the optimal values of P and Q. However, the main difference here is that SVD-LM incorporates the local mean of the users' and items' ratings to enhance the rating prediction. The user ratings local mean μ u and the item ratings local mean μ i can be extracted directly from the user-item rating matrix R.

The WSL Algorithm
WSL algorithm combines WAFE-SVD and SVD-LM, which extend the SVD model differently, to leverage their advantages and enhance the prediction accuracy. The main advantage of WAFE-SVD is to define meaningful latent factors and initialise users and items features with meaningful values representing the extent to which items and users are involved in each factor. Besides, SVD-LM has the advantage of improving the prediction model by introducing two new predictors (i.e., the local mean of user ratings and the local mean of item ratings). Thus, these two extension methods (i.e., WAFE-SVD and SVD-LM) are combined reasonably in WSL to exploit their advantages. Algorithm 1 describes how WSL fits the two components WAFE-SVD and SVD-LM together.

ALGORITHM 1: WSL
Input: Rating matrix R, item features, learning rate γ , regularisation constant λ, and number of iteration epochs. Output: The estimated rating matrixR 1: Extract item and user features using WAFE 2: d ← K 3: P ← user features 4: Q ← item features 5: initialise Bu, Bi 6: μ ← overall rating average 7: Mu ← users' rating averages vector 8: Mi ← items' rating averages vector 9: for all known r ui do 10: calculater ui using Equation (5) 11: end for 12: calculate RMSE, using Equation (7) 13: RMSE min ← RMSE 14: for epochs do 15: for all known r ui do 16: calculater ui using Equation (5) 17: end for 23: calculate RMSE, using Equation (7) 24: RMSE cur r ent ← RMSE 25: if RMSE cur r ent < RMSE min then 26: RMSE min = RMSE cur r ent 27: Q opt , P opt , Bu opt , Bi opt ← P, Q, Bu, Bi 28: end if 29: end for 30: calculateR for Q opt , P opt , Bu opt , Bi opt WSL starts with utilising WAFE to extract item and user features and assign P and Q to user and item features values. However, users' biases (Bu) and items' biases (Bi) are initialised with random values. Then, the local mean of users (Mu) and items (Mi) ratings are calculated along with the global (overall) rating average (mu). Then, using all known ratings in R, the estimated ratings are calculated using Equation (5) to get the initial RMSE. Then, the estimation step is repeated until we get the optimal RMSE taking into consideration the predefined number of iteration (epochs). Then, P, Q, Bu, and Bi are assigned their optimal values at the iteration with the lowest RMSE. Finally, all entities inR are estimated using Equation (5), taking into consideration the optimal values of P, Q, Bu, and Bi.
In the next section, we describe our proposed explainable recommendation model.

THE EXPLAINABLE RECOMMENDATION (ER) COMPONENT
In this model, the explanation of each user's provided recommendations is generated based on the relationship between the user and the recommended item. More precisely, ER relies on analysing the correlation between the user and each feature of the recommended item's features. Then, a recommendation is justified by extracting the extent to which the user is interested in the item's features. In other words, ER explains the recommendation based on the intersection of the item's features and the user interesting features. This will convince the user of the recommended item.
Recall from Section 6, WSL produces meaningful users and items features that are used to define the latent factors and represent user and item features values P and Q. These meaningful factors and features values are helpful for making the recommendation results interpretable. This motivates us to use these meaningful features to explain the recommendation. The produced feature values by WSL represent the extent to which the item and the user are involved in each latent factor. For example, in a movie recommender system, in which one of the defined latent factors is action, the user u and the item i are assigned with a value that represents the extent to which u is interested in action movies and the extent to which i is belonging to be an action movie. Suppose that i is recommended to u. If the value of u's action feature and the value of i's action feature are high (i.e., closer to the maximum feature value than the minimum feature value), then this recommendation is justified as follows: "This movie has been recommended to you because we found that you are interested in action movies, which it is one of this movie's features. " In addition, ER presents the recommendation justification using a bar chart to show the intersection between the user and the item features. The following text is to explain in detail how this explanation is performed: First, we introduce the following thresholds: • ts r ec : the recommendation threshold for assigning whether an item is recommended to a user or not. • ts int : user's interest threshold, for assigning whether a user is interested in a feature or not.
• ts sim : the similarity threshold for assigning whether a user and an item are similar or not.
Then, we calculate the similarity sim u,i between u and i. To do so, the following notation is defined: • v pos i : the set of features that the item i is assigned with a positive value. We refer to this as item positive features.
• v int u : the set of features that user u is assigned with a value ≥ ts int . We refer to this as user interesting features.
• v u,i : the intersection features between item i positive features and user u interesting features all features (i.e., v u, The similarity sim u,i between user u and item i is calculated using Equation (6).
We applied the factor 1 2 to average the resulting value of adding r max will result in a value in the range (0,1), asr u,i ≤ r max . Therefore, since we are aiming to calculate the degree of how the user and the item are similar, we applied the factor 1 2 to average these two values for representing the similarity score. Considering the introduced similarity threshold ts sim , user u and item i are considered to be similar if their similarity score sim u,i ≥ ts sim . This reflects how strong the relationship between u and i is.
Therefore, for each recommendation case inR that is assigned withr ui ≥ ts r ec , this recommendation is justified based on the values of sim u,i and v u,i . If the user u and the item i are similar (i.e., sim u,i ≥ ts sim ), then we examine whether v u,i > 0?, which reflects whether there are intersection features between u and i or not. If so, then the recommendation is assigned to be explainable and it is explained by representing the intersection features between u and i.

EVALUATION
The evaluation of recommender systems can be performed using either an online or an offline experiment. In the online evaluation approach [32,34], the participants are provided with a list of recommendations and/or explanations; then, they give some feedback on what they have been recommended with (e.g., by answering some questions or taking a reaction on the recommendation). Then, their feedback is analysed to evaluate the model. However, the offline evaluation approach [1,12] is done on the basis of rating predictions. Online evaluation requires more resources than offline evaluation, e.g., accessing existing e-commerce websites and interacting with some of its active users. It also may require a long time for achieving a convincing result depending on the activity levels of those websites. In contrast, in the offline evaluation approach, the available data is split into training and testing sets. The evaluation is performed based on the results of running the trained model to predict the results of the test data. In this study, we use the offline method on six benchmark real-world datasets. The evaluation criteria are described in the following text.

Dataset
Six real-world datasets have been used as follows: three datasets form GroupLens 1 [7,38]  movie recommendations, in which the ratings are given in a 5-star rating scale (with half-star increments for ML10M). Demographic user information is also provided in Ml1M, such as age, gender, occupation, postcode, in addition to movies' side information, such as title, genre, and year of release. In ML10M, side information about movies is provided; however, no user information is given apart from user ID. Book-Crossing is a dataset for book recommendations, in which the ratings are given on a 10-star scale rating. Book-Crossing provides some item side information such as year of release. Amazon datasets, i.e., AVG, AIV, and AF, provide user ratings and reviews on products. The rating is given on a 5-star scale rating, and no side information is provided in Amazon datasets. As the datasets ML1M, ML10M, and BC provide some item side information, these are local data, as they are available in the same domain. Thus, we utilised these data for modelling item features along with some useful features that are extracted from the user-item rating data. However, as the datasets AVG, AIV, and AF do not provide any side information, we only adopt the features that are extracted from user-item rating data for modelling item features. The statistics of all datasets are shown in Table 1.

Baselines
We compare our method with a number of related methods in terms of rating prediction accuracy as follows: • SVD [22]: Singular Value Decomposition is a standard matrix factorisation method that adopts only user-item rating data. • SVD++ [13]: As an extension of SVD, SVD++ incorporates implicit feedback of users and items. • PMF [21]: Probabilistic Matrix Factorisation is a matrix factorisation method that improves the performance of the traditional matrix factorisation using only user-item rating data. • NMF [17]: Non-negative Matrix Factorisation is a matrix factorisation method that factorises the rating matrix into two non-negative matrices. • HFT [19]: Hidden Factors as Topics is a method that models reviews and ratings jointly.
It uses an exponential transformation function to perform a combination between latent factors vectors and topic description. • NeuMF [9]: Neural Matrix factorisation method that utilises multiple perception neural network architecture into matrix factorisation model to learn latent factors. • PrefCRBM [23]: Preference Relation-based Conditional Restricted Boltzmann Machine is a CF method that integrates the preference relationship and side information of items to produce recommendations using RBM. • EFM [34]: Explicit Factor Model is a method that adopts phrase-level sentiment analysis to model latent factors by integrating features of explicit and implicit feedback.
• AMF [12]: Aspect-based Matrix Factorisation is a method that analyses user's text reviews to derive aspects as features and link users and items through these aspects.

Evaluation Metrics
Two groups of evaluation metrics have been used in this study to evaluate our model in terms of (i) rating prediction and (ii) explanation performance.

Rating Prediction.
For rating prediction evaluation, we adopted root mean squared error (RMSE) and mean absolute error (MAE), which represent the prediction error of user ratings on unrated items. RMSE and MAE are calculated using Equations (7) and (8): where T is the number of examples in the test set,r t is the predicted rating of the tth test example, and r t is the actual rating of the tth test example.

Top-n Recommendation.
A further evaluation method for recommender systems is to perform a top-n recommendation. In this evaluation method, the system produces a list of recommendations for each user and selects the top n of them to be presented to the user. The evaluation of this method is performed using some relevant metrics. Mean Average Precision (MAP) is a widely adopted metric in this manner [23]. In our evaluation experiment, we adopted MAP@n, which is calculated using Equation (9) as follows: where, |T user | is the number of users in the test data, l u is the number of relevant items to the user u in the test data, and pre u @c is the precision of the user u at the position c, which is calculated using Equation (10) as follows: where tp@c is the number of true positive cases at position c, i.e., the number of relevant items appear in the top-c recommendation list. We conducted the experiment of this evaluation metric and compared the results to those reported by Reference [23].

Explanation Performance.
For explanation performance analysis, the evaluation is performed from two aspects, i.e., (i) overall satisfaction and (ii) user case studies, as follows: (a) Overall Satisfaction: for evaluating the overall user satisfaction, we adopt some classification evaluation metrics including accuracy, error rate, precision, recall, and F-measure to evaluate the performance of our explanation model as follows: In the test data, each test case is assigned whether to be a recommended case or not, based on the predicted rating considering the recommendation threshold (ts r ec ). The decision to be recommended is based on the predicted rating, which would be already defined in the previous step (rating prediction). Hence, we could convert each case in the test data to be either 1 for recommended case or 0 for unrecommended case. Besides, in our explanation method, the user and the item for each test case are assigned to be either similar or not, as explained in Section 7. This assignment is The ratio of the correctly predicted positive cases over the total positive cases.

Recall (rec) rec = tp tp+tn
The ratio of the correctly predicted positive cases over the positive predicted cases. The harmonic mean value between the values of precision and recall.
performed using our similarity calculation method, which can be used to examine whether the recommendation can be reasonably explained or not, i.e., 1 for similar user and item case and 0 otherwise. Hence, we can assign each case in the test data to one of the following: The confusion matrix of the explainable recommendation cases conducted by our method are illustrated in Table 2.
Thus, metrics such as, accuracy, error rate, precision, recall, and F-measure can be used to evaluate the method. The evaluation metrics, accuracy (acc), error rate (err ) precision (pre), recall (rec), F-measure (f me) are illustrated in Table 3.
We also define another evaluation metric called Positive Satisfaction Ratio (PSR) [12] as an evaluation metric for explanation evaluation. For each user u in the test data, we randomly select two items i 1 , i 2 in which one has high predicted rating, i.e., greater than or equal to the recommendation threshold (r u,i 1 ≤ ts r ec ) and the other has low predicted rating, i.e., less than the recommendation threshold (r u,i 2 < ts r ec ). Then, we calculate the similarity score for each case using Equation (6). Then, we examine whether or not the case with high rating prediction was assigned a higher similarity score than the case with low rating prediction. If so, then the case is assigned to be a positive satisfaction case. Applying this for all users in the test data, we then calculate the ratio of positive satisfaction cases over all test cases using the formula in Equation (11): where #positive_cases and #total_cases are the number of positive satisfaction cases and the number of users in the test data, respectively. (b) User Case Studies: Similar to Reference [12], we introduce user case studies to evaluate another aspect of the explanation performance for each dataset as follows: For each dataset, we randomly select a user u from the test data and study their relationship with the two items (i.e., i 1 and i 2 ) that have been selected in the previous step (i.e., PSR calculation step). We aim to investigate whether the relationship between the user u and the item i 1 (the item with high predicted rating) is stronger than the relationship between the user u and the item i 2 (the item with low predicted rating). To do so, we introduce a user-item intersection score (intSec u,i ), which is calculated using Equation (12): where an intersection score of 1 means that the user is interested in all of the item's features, and 0 means that the user is not interested in any of the item's features. We calculate the intersection score for u and i 1 (intSec u,i 1 ), and for u and i 2 (intSec u,i 2 ). Then, we compare the values of intSec u,i 1 and intSec u,i 2 to examine whether the u is interested in i 1 's features more than i 2 's features. If so, then we conclude that our recommendation for this case study is reasonably explained. We use a bar chart for showing the features in common between each user and item. We also present a summary table of the intersection scores for each user and item in each case study. This will clearly illustrate how the user is related to each item in each case study.

Parameters Setting
The parameters of this experiment include d, the number of latent factors; λ, the regularisation extent; γ , the step size (learning rate) and the number of epochs (iterations).
In an MF-based model, a well-known technique for defining the best value of d is to examine the method's performance when verifying the value of d. However, in WSL, since the users' and items' features will be already defined, the number of latent factors d is assigned the number of the extracted features. The number of the extracted features for all datasets are presented in Table 4. The evaluation was done for different values of the remaining parameters as follows: λ = [0.01, 0.02, . . . , 0.15] and γ = [0.001, 0.002, . . . , 0.01], and assigned them to their optimal values as illustrated in Table 4. The number of epochs is assigned to 100 with early stopping; we picked the best RMSE and MAE when achieved in an earlier iteration.

RESULTS AND ANALYSIS
We evaluate our proposed model in two aspects, i.e., (1) the rating prediction performance and (2) the explanation performance. For the former, we evaluate the performance of the recommendation by comparing the accuracy of the rating prediction of our method with the baselines. For the latter, similar to Reference [12], we evaluate our explanation performance by calculating the ratio of positive satisfaction for all users in the test data. Then, we compare the performance of our explanation model with the work of Reference [12]. More details are given in Section 9.3.

Rating Prediction Performance
The evaluation of the rating prediction has been performed in this study to examine the effectiveness of the components of our proposed method. The evaluation of performance of rating prediction is done in two aspects: (1) The effusiveness of WSL on the SVD method and (2) Comparison with baseline methods, using RMSE and MAE as evaluation metrics.

The Effusiveness of WSL on the SVD Method.
WSL extends the SVD method in two main aspects: (1) adopting WAFE for defining latent factors and initialising user and item feature values and (2) introducing two new predictors to be incorporated in the prediction model (user rating local mean μ u and item rating local mean μ i ), we run an ablation study to show the effectiveness of these two extensions by splitting WSL into the following sub-methods: (1) SU: This sub-method extends the SVD by adding only user rating local mean μ u to the prediction model. In SU, the prediction model is defined as in Equation (13): (2) SI: This sub-method extends the SVD by adding only item rating local mean μ i to the prediction model. In SI, the prediction model is defined as in Equation (14): (3) SVD-LM: This sub-method extends the SVD by adding both user rating local mean μ u and item rating local mean μ i to the prediction model. In SVD-LM, the prediction model is defined as in Equation (5).  These sub-methods have been implemented in addition to our main method WSL, which combines SVD-LM and WAFE-SVD. WSL extends the SVD by incorporating both user rating local mean and item rating local mean, and utilising WAFE for defining latent factors and initialising user and item feature values. WSL utilises the prediction model defined in Equation (5). The simulation for this evaluation is done using two datasets (i.e., ML-1M and ML-10M) running each method three times and taking the average of the achieved RMSE and MAE and the number of epochs needed for achieving the best RMSE and MAE. The results are shown in Table 5.
The results in Table 5 present the performance of the all defined sub-methods along with WSL and SVD by reporting the best RMSE and MAE achieved by each method and the number of epochs needed for achieving them. These results illustrate the effectiveness of our proposed SVD extension as follows: (1) The effectiveness of incorporating the local mean of user and item rating: Recall that our three methods, SU, SI, and SVD-LM, examine the effectiveness of incorporating the local mean of user and item rating; it is clear from the results that this incorporation has improved the rating prediction accuracy and reduced the required number of epochs for achieving the best accuracy. For the ML1M dataset, the RMSE achieved by SVD-LM (which combines SU and SI) is 0.8569 comparing to 0.8669 for SVD. Similarly, for the same dataset, the MAE achieved by SVD-LM is 0.6726 comparing to 0.677 for SVD, while the number of epochs needed for achieving the best RMSE was reduced to 55 for SVD-LM, comparing to 85 for SVD. In addition, for the ML10M dataset, SVD-LM achieved RMSE and MAE of 0.7929 and 0.6092, respectively, compared to 0.8063 and 0.6198 for the SVD method. Besides, the required epochs was reduced to 55 for achieving the best RMSE comparing to 81 for the SVD, and 62 for achieving the best MAE comparing to 87 for the SVD.  WSL also reduced the required epochs for achieving the best RMSE and MAE to 24 and 25, respectively, comparing to more that 80 for both metrics in SVD.

Comparison with Baseline Methods.
We compared the prediction accuracy performance achieved by our proposed method WSL with the baseline methods using two MovieLens datasets (ML1M and ML10M) and Amazon Video Games (AVG) dataset. The two datasets ML1M and ML10M were adopted for the comparison with SVD, SVD++, and PMF baseline methods, while the AVG dataset was used for the comparison with the methods NMF, HFT, EFM, and AMF. The results and the analysis of this experiment are presented in the following two bullet points.
(1) MovieLens datasets (ML1M and ML10M): Using these two datasets, we run the simulation for our proposed method WSL in addition to three baseline methods SVD, SVD++, and PMF. We evaluated the performance of our method (WSL) by comparing its best RMSE and MAE with those achieved by SVD, SVD++, and PMF. In addition, we compared the number of epochs needed for achieving the best accuracy for all methods. Table 6 presents the comparison of the RMSE and MAE achieved by all methods, and Figure 2 illustrates the number of epochs needed to achieve the best RMSE and MAE by all methods. The results in Table 6 show that our method WSL outperformed all the compared methods for both datasets (ML1M and ML10M). The results also show that WSL has significantly reduced the required epoch for achieving the best accuracy. WSL achieved the best RMSE and MAE in the first 25 epochs for both datasets, while all other methods needed at least 40 epochs to accomplish the best accuracy. For example, for the ML1M dataset, SVD needed 85 epochs to accomplish the best RMSE and 100 epochs to achieve the best MAE. Figure 2 illustrates a comparison of the best RMSE and MAE obtained by each method considering the number of epochs needed when running the experiment for ML1M and ML10M datasets.
(2) Amazon Video Games dataset: For the AVG dataset, we evaluate the performance of WSL by running the simulation using different data split criteria to compare its prediction accuracy with four baseline methods from the literature that are reported in Reference [12], namely, NMF, HFT, EFM, and AMF. Table 7 shows the results for the RMSE and MAE for the four baseline methods against the WSL method when verifying the size of the test data for the AVG dataset. It is clear from the results in Table 7 that our method WSL outperforms all baseline methods in all cases. Figure 3 shows a comparison of the RMSE and MAE for all compared methods when splitting the data into 80% for training and 20% for testing. It shows that our method WSL has achieved the lowest RMSE and MAE compared with the baseline methods.
In addition, we conducted a significance test using paired t-test for our method WSL and each of the baseline methods (i.e., NMF, HFT, EFM, and AMF) resulting in four paired t-test cases; each t-test for WSL and one of the baseline methods. In this experiment, we consider a null hypothesis and an alternative hypothesis as follows: • Null hypothesis: The performance of WSL and the considered baseline method are at the same level. • Alternative hypothesis: The performance of WSL is better than the considered baseline method.
We examined these two hypotheses by calculating the p-value for each t-test. The resulting p-values for each t-test are reported in Table 8.
The results of the paired t-test conducted for WSL and each of the baseline methods show that the achieved p-value for each case is within the range of rejecting the null hypothesis (i.e., p-value < 0.05). Thus, this evidence leads us to accept the alternative hypothesis and conclude that the improvement achieved by WSL is statistically significant.

Top-n Recommendation
We conducted this experiment to evaluate the WSL performance in terms of top-n recommendation. As explained in Section 8.3.2, we followed the experimental setting of Reference [23] in this experiment for the purpose of comparison. We adopted the evaluation metric MAP@n and split the data into 70% for training and 30% for testing. We conducted the experiment for the ML-1M dataset. The results are reported in Table 9. Similar to Reference [23], the n is assigned to 10. The results in Table 9 represent the performance of all compared methods in terms of MAP@10, i.e., the mean average precision at the top 10 recommendation list generated for all users. Table 9 reports the MAP@10 in different data sparsity as we verify the percentage of the data taken for training. The results show that our method, WSL, outperformed all baseline methods in all cases. Despite the fact that increasing the test size can enlarge the size of the related items in the test data for each user, it decreases the rating prediction accuracy. Recall that the items in the top-n recommendation are listed in descending order based on the predicted ratings for them with the user. Thus, the more accurate rating prediction the method achieves, the more relative items can be listed in the top-n recommendations. It is clear from the results that the prediction accuracy has a significant impact on the MAP@n measure for top-n recommendation accuracy.
In addition, we examined the effectiveness of the number of items in the top recommendation list, i.e., the value of n. In this examination, we conducted our simulation for calculating the MAP@n metric when verifying the value of n in the range (10, 20, . . . , 50). This experiment was performed for the SVD and WSL methods using the ML1M dataset, splitting the data into 70% for training and 30% for testing. The results are presented in Table 10.
The results in Table 10 show that the MAP@n score improves with higher value of n. This improvement is illustrated as expanding the size of the list of the top recommendations is more likely to increase the chance of including more relative items to the user. At the same time, the rating prediction accuracy remains at the same level, as we do not decrease the size of the training data. It should also be noted that the improvement of the MAP@n score decreases after the point of n = 20; as the difference between the achieved score and the previous score becomes smaller starting from the point of n = 30. This result is reasonable, as the number of uncovered relative items in the top-n list for each user is becoming smaller while increasing the value of n.

Explainable Recommendation
As explained in Section 8.3.3, we evaluated the explanation of our recommendations model in two ways, i.e., overall satisfaction and user case study.

Overall Satisfaction.
The overall explanation performance is evaluated in terms of the accuracy, errorrate, precision, recall, and F -measure metrics. Table 11 shows the results of these metrics running our experiment for the six datasets when splitting the data into 80% for training and 20% for testing and setting the parameters as specified in Table 4.
As shown in Table 11, the proposed method WSLER performs very well for the explainable recommendation for all datasets. The results show that the heights accuracy and the lowest error rate were achieved by WSLER for the BC dataset with 0.94 accuracy and 0.06 error rate. Besides, the highest F-measure score, which somehow concludes the precision and recall, was achieved by WSLER for the BC dataset with 0.96. It is also clear that the lowest accuracy achieved is 0.81, and the highest scored error rate is 0.19 for the AIV dataset. In addition, the lowest F-measure score is 0.88, which was accomplished for the AIV dataset.
In addition, we evaluated our explainable recommendation model WSLER by calculating the positive satisfaction ratio (PSR) for all users in the test data, as explained in Section 8.3.3. We applied this evaluation criteria for all datasets, and the results showed a good performance with a minimum PSR of 94% for the AIV dataset and 98% for the AF dataset as the highest PSR. Results are shown in Figure 4.   5. A comparison the Positive Satisfaction Ratio (PSR) achieved by our method WSLER and AMF [12] for AVG dataset (when assigning ts r ec = 4 similar to Reference [12]).
Moreover, we compared the performance of our explainable recommendation model WSLER with the AMF model proposed by Hou et al. [12] in terms of PSR. For the purpose of the comparison, we selected AVG dataset, as it is the only dataset adopted by Reference [12] among all datasets that we adopt in our study. Therefore, we set the recommendation threshold (ts r ec ) to 4 instead of 3, as similar to Reference [12]. Figure 5 shows a comparison of the PSR achieved by our model WSLER with the AMF model [12].
The results show that our proposed WSLER model significantly outperformed AMF model. WSLER achieved 92.95% PSR compared to 84.31% achieved by AMF method, as reported in their paper [12]. In this section, we present the result of the user case studies analysis for each dataset. As we adopted six datasets in this study, i.e., ML1M, ML10M, BC, AVG, AIV, and AF, we performed six case studies: one for each dataset. For each case study, we randomly selected one user with two items of their rated items, in which one is assigned a high predicted rating (recommended) and the other is given a low predicted rating (unrecommended). Then, we examine whether the user is more interested in the recommended item's features than the unrecommended item's features. We calculated the intersection score intSec u,i for the user and the recommended item and for the user and the unrecommended item. Figure 6 presents the features in common between each user and item for each case study, and Table 12 shows the comparison of the intersection score for the user and both items in each case study. The results are discussed in the following text.
• Case study 1 (ML1M): user 22, item383, and item570 were randomly selected, in which item383 is assigned a high predicted rating, while item570 is assigned a low predicted rating by the user user 22. We found that user 22 is interested in four out of five features of the recommended item item383, while user user 22 is interested in only one feature out of four of the features of the unrecommended item item570 (see Figure 6(a)). We also calculated the intersection score for user 22 with both item383 and item570, and we found that the intersection score for user 22 and item383 is 0.8 compared to only 0.25 for user 22 and item570 (see Table 12). The recommended item item383 corresponds to five features, i.e., drama, thriller, reputation9, popularity4, and year4. Out of these five features, the user is interested in the features drama, thriller, reputation9, and year4. This means that the user is interested in most of the item's features, so we can confidently recommend it. This recommendation can be simply justified as "We recommend this item to you as it has the features (drama, thriller, repu-tation9, and year4) which all are some of your interest. " In contrast, the unrecommended item item570 corresponds to four features, i.e., comedy, reputation5, popularity2, and year8. Out of these four features, the user user 22 is interested in only one feature, i.e., comedy. • Case study 2 (ML10M): user 22 was randomly selected along with item184 and item19, in which item184 has been assigned with a higher predicted rating than item19. The recommended item item184 is categorised with six features, i.e., action, crime, thriller, reputation6, popularity5, and year8. We found that the user user 22 is interested in all item184's features. This means that the recommendation can be reasonably explained as the intersection between the user's interesting features and the item's features is very high. An explanation of this recommendation could be "We found this movie which has the features (action, crime, thriller, reputation6, popularity5, and year8) in which we think that you are interested in movies with these features. " However, user 22 is interested in only two features out of four features of the item item19. Figure 6(b) illustrates the values of user 22's features against the features of item184 and item19.
• Case study 3 (BC): user 305 was selected randomly and item180 was assigned a high predicted rating while item22 was assigned a low one. In this case study, the recommended item item180 has three features, i.e., reputation7, popularity2, and year9. We found that the user user 305 is interested in all these three features. This reflects a strong relationship between the user and the item, as their intersection features are three out of three item's features. This recommendation could be explained as "This item is recommended to you as we found that you are interested in all its features, which are (reputation7, popularity2, and year) . " However, we found that the user is interested in only one of the features of the unrecommended item item22. Thus, the relationship between the user and the recommended item is stronger than the relationship between the user and the unrecommended item. This can be seen clearly from the intersection features between the user and the two items. Figure 6(c) shows the intersections between user 305's interesting features with the features of item180 and item22. • Case study 4 (AVG): user 71 was randomly selected along with item10056 and item6871, and item10056 was assigned a higher predicted rating than item6871. In AVG dataset, user and item features are categorised into two feature concepts, which means that the maximum possible number of positive item features is two. We found that the user user 71 is interested in both features of the recommended item item10056, which are reputation9 and popularity3. However, the user is not interested in any of the features of the unrecommended item item6871. Therefore, it is reasonable to recommend item10056 to the user user 71, which can be justified as follows: "This item is recommended to you as we found that you are interested in items with features (reputation9 and popularity3) which are the features of this item. " A summary of features intersections between the user and the two items are presented in Figure 6(d). • Case study 5 (AIV): user 308 was randomly selected along with item12379 and item6229, and item12379 was assigned with a higher predicted rating than item6229. All features of the users and items in AIV dataset represent two feature concepts, in which the maximum possible number of positive item features is two. The results show that user 308 is interested in all features of the recommended item item12379, while he/she is not interested in only one of the features of the unrecommended item6229. Thus, it is reasonable to recommend item12379 to the user user 308, in which the recommendation can be justified as: "This item is recommended to you as we found that you like the items with features (reputation8 and popularity1) which both are the features of this item. " The values of the intersection features between user 308 and both item12379 and item6229 are illustrated in Figure 6(e). • Case study 6 (AF): user 833 was randomly selected along with item179203, which was assigned a high predicted rating, and item92798, which was assigned a low predicted rating. Similar to AVG and AIV, all features of the users and items in AF dataset represent two feature concepts, which means that the maximum possible number of positive item features is two. In this case study, we found that the user is interested in both features of the recommended item item179203, which are (reputation10 and popularity1). However, only one feature of the features of the unrecommended item item92798 is of the user's interest. This can explain the reason of recommending item179203 to user 833 as "You are interested in items with features (reputation10 and popularity1), and we found this item which has both features. " The intersection features between the user user 833 and both items item179203 and item92798 are shown in Figure 6(f).
In all case studies, the results show that WSLER is able to provide a reasonable explanation for the recommendations suggested to the users. The relationship between the user and the recommended item (i.e., the item with a high predicted rating) is stronger than the relationship between the user and the item with the low predicted rating in all case studies. In fact, the preformed case studies have examined how much the user is interested in the item features taking into consideration the number of features that describe the item. For example, in case study 1, the user is interested in four out of five features of the recommended item features compared to only one out of four of the unrecommended item's features. Similarly, in case study 2, the user is interested in six features out of six features that describe the item compared to only two out of four of the unrecommended item features. Moreover, case study 3 shows that all features of the recommended item are interesting features to the user, while only one feature out of three of the unrecommended item's features is interesting to the user. In case studies 4, 5, and 6, in which the items are described using two features, the results show that all recommended item features are interesting features to the users in all three cases.
It is also clear from the results in Table 12 that for all case studies, the intersection score between the user and the recommended item is 0.8 or higher, compared to 0.5 for the unrecommended item.
Thus, this means that the users in all cases are interested in all or most of the recommended item features. At the same time, they are interested in only a few or none of the unrecommended item features. These results can explain why such recommendations are performed, i.e., an item can be recommended to a user if there is a strong relationship between the user and the item. This is because our method, WSLER, different from other MF-based methods, adopts useful knowledge for modelling the user and item latent factors, making these factors interpretable. More precisely, the defined meaningful latent factors and the initial values of user and item features can explain how much each user or item is related to each latent factor. We found that these relationships are useful for explaining the recommendations provided by WSLER. In fact, the relationships between user and item that are derived from the initial values of their latent features are compatible with the predicted ratings performed by WSLER, which is the base of the recommendation.

CONCLUSION AND FUTURE WORK
Explainable recommender systems provide users with reasonable justification of what they have been recommended. This explanation plays a vital role in increasing the system reliability. This explanation can be performed using different approaches, such as feature-based explainable recommendations. In this article, we introduced WSLER, an explainable recommender system framework. Inside WSLER, we proposed WSL: an MF-based recommender system model, and ER: an explainable recommendation model. WSL extends the well-performing MF model (i.e., SVD) to produce meaningful latent factors and useful features values and improve the prediction accuracy. At the same time, ER utilises the useful features produced by WSL for explaining the recommendation. In contrast to the existing SVD methods, WSLER exploits only local data without relying on private or outer data for providing accurate and explainable recommendations. We evaluated our proposed method using six real-world benchmark datasets in terms of rating prediction and explainability performance. Our experimental simulation results show that our method outperformed very well in both evaluation aspects.
One limitation of this study is that the experimental evaluation was performed using offline criteria. However, it would be worthy to perform the evaluation using an online evaluation method to investigate the effectiveness of the proposed method on the users' purchasing behaviours. Consequently, the evaluation criteria for the top-n recommendations can be performed to genuinely reflect the users' reaction beyond the top recommendation list. In addition, this study mainly focused on the rating prediction evaluation. However, one of the most important aspects of recommender system research is to investigate how the recommendations are presented to the user. This topic needs further research. It is also essential to address the question of how to best present recommendations to the users.