Selected outcomes

Introduction

In this work, we looked into the problem of learning from relational data, as studied in the fields of statistical relational learning (SRL) and statistical relational artificial intelligence (StarAI). We digged into the problems of learning from what we call relational marginals. The text below is a light overview of our recent work (Kuzelka, Davis, and Schockaert, 2017).

Distributions over Possible Worlds (A Brief Intro to SRL)

In SRL, knowledge about objects and relations among them is typically represented using first-order logic formulas. For instance, C = friends(alice,bob) ∧ friends(bob,alice) ∧ smokes(alice) ∧ smokes(bob) is a conjunction denoting that there are two objects alice and bob which are in the friendship relation (which in this case is mutual as there is both friends(alice, bob) and friends(bob, alice)). The conjunction C consists of four atomic formulas (atoms) friends(alice, bob), friends(bob, alice), smokes(alice), and smokes(bob). Similarly, D = smokes(alice) ∨ smokes(bob) is a disjunction, consisting of two atoms, and it denotes that at least one of alice and bob smokes.

Note: We will use the convention from logic programming that names of objects, i.e. constants, start with a lower-case letter and names of variables, e.g. X, Y, with an upper-case letter.

Given a set of objects, e.g. {alice, bob, cecilia, dave, eve}, a possible world is a set of all atoms that are true (if an atom is not present in this set then we assume that it is not true). One such possible world can be the empty set W1 = {}, which in the example of friends and smokers would mean that no one is friends with anyone else and that no one smokes. A possible world W2 = {smokes(alice), smokes(bob), smokes(cecilia), smokes(dave), smokes(eve)} means that no one is friends with anyone else but everyone smokes. Another example W3 = {friends(alice,bob), friends(bob,alice)} means that only alice and bob are friends and no one smokes. The formula F1 = smokes(alice) ∧ smokes(bob) is true only in the possible world W2 and false in the other two worlds. The formula F2 = ∀ X,Y: friends(X,Y) ⇒ friends(Y,X), intuitively meaning that friendship is a symmetric relation, is true in all three of the possible worlds. However, it would be false e.g. in a possible world W4 = {friends(alice,bob)}.

Statistical relational learning aims at modelling probability distributions over possible worlds. Various SRL frameworks differ in how they represent the probability distributions. For example, Markov logic networks (Richardson and Domingos, 2006) are models based on the exponential-family distributions. A Markov logic network is given by a set of weighted first-order logic formulas. The weights of the formulas in Markov logic networks are learned through optimization of likelihood of a given possible world (which plays the role of training data in SRL).

Note: What we have described above for Markov logic networks is known as generative learning. There is also discriminative learning with which we do not deal here.

Relational Marginal Distributions (What Are They?)

To start, let us imagine that we have a sample of m nodes from some large social network and also all the friendship relations among the sampled nodes (represented by friends(a,b) atoms). What can we do with such data so that it would make sense statistically?

First, we need to have some assumptions about how we got the data (and, of course, the assumptions should better be satisfied, otherwise we would be making wrong inferences). Let us assume that the nodes that we have were sampled uniformly without replacement from the whole social network. Let us further assume that we are also given all the relations among the nodes in the sample (e.g friendship) and also all attributes of all the nodes in the sample.

Second, we need to define statistics of the data that we want our model to be based on. Ideally, these statistics should also be easy to interpret by humans. In (Kuzelka, Davis, and Schockaert, 2017) we introduced statistics that we call relational marginals which are easy to understand and interpret. In what follows we will base the description on this kind of statistics.

To define relational marginals, we consider the following sampling process: We sample uniformly a set B of k nodes from the the network sample (in logic terms, the nodes are constants, i.e. names of the objects), where the pre-fixed integer k, called the width of the relational marginal, is assumed to be much smaller than m, the number of nodes in the sample. Then we take all atoms that the sample contains and that only speak of the nodes in the sample (you can imagine this also as taking a subnetwork induced by the small sampled set containing the k nodes) and put them into a set A. We can then think of the set A as defining a possible world on a domain B. However, the different possible worlds given by this process are defined on different domains (e.g. {alice,bob}, {alice, cecilia}, ...). So we cannot really use them to construct any meaningful distribution. Therefore, assuming that the nodes of the network are indistinguishable (for what we want to model), we can standardize them. We take the set of integers {1, 2, ... , k}, which will be used as a uniform representation for the nodes in the sampled sets (recall that the nodes are represented as logic constants and relations as logic atoms). We can construct the set of all possible worlds (using the relations from the domain, e.g. friends(., .) and smokes(., .)) on the domain {1, 2, ... , k} that are isomorphic to the sampled possible world given by A. Each of these possible worlds is an equally good representative of A. Therefore we draw one of them randomly (in accordance with the maximum entropy principle). The distribution of thus standardized possible worlds is what we call relational marginal distribution.

A nice thing about relational marginal distributions is that they are distributions on possible worlds. This may seem to be no big deal but it actually allows us to model relational marginal distributions using standard (non-relational) probabilistic techniques without any reference to the actual domain of the large training data (e.g. the social network). For instance, we can exploit ideas from our earlier work (Kuzelka, Davis, and Schockaert, 2016) to obtain a model based on possibilistic logic, which we did in this work. One might ask at this point why the ability to encode the distributions in possibilistic logic should be useful for anything. Before we answer that, let us briefly describe inference in possibilistic logic.

Possibilistic Logic (What Is It?)

Like Markov logic networks, possibilistic logic theories (Lang, Dubois, and Prade, 91) consist of weighted logic formulas. The main difference thus lies in the inference; how one uses the weighted formulas to derive conclusions from some given evidence. Let us have a possibilistic theory consisting of the three weighted formulas: A = (∀X: bird(X) ⇒ flies(X), 0.5), B = (∀X: penguin(X) ⇒ ¬ flies(X), 1), and C = (∀X: bird(X) ⇒ penguin(X), 1). The set of these three rules is a consistent theory (when we ignore their weights). A possibilistic logic theory which is consistent behaves exactly like a classical logic theory. So we can for instance derive from these rules that ∀ X: ¬ penguin(X). Similarly, if we add to the theory bird(tweety), which we will call evidence, then we can also derive that flies(tweety) and also ¬ penguin(tweety).

Where inference in possibilistic logic diverts from inference in classical logic is when the rules in the theory are not consistent. For instance, when we add the evidence penguin(tweety) to the theory consisting of the rules A, B, C from the previous paragraph, the theory becomes inconsistent (i.e. the rules and the evidence cannot be simultaneously satisfied). In such cases, inference in possibilistic logic proceeds as follows. We find the lowest weight w* such that the theory (plus the evidence) becomes consistent when we remove all rules with weight w* or lower. In our example, we have w* = 0.5. The consistent remainder of the possibilistic theory is then treated again as a classical logic theory. That is, in our example, we are left with ∀X: penguin(X) ⇒ ¬ flies(X), and ∀X: bird(X) ⇒ penguin(X) together with the evidence penguin(tweety). From that we can derive e.g. bird(tweety) and ¬ flies(tweety). This last example also illustrates that possibilistic logic is non-monotonic, which is a property that classical logic does not have.

Note: The inference described above is only one type of inference (arguably one of the most useful types of inference in possibilistic logic) that can be performed in possibilistic logic; it is called plausible inference.

Representing Relational Marginal Distributions as Possibilistic Logic Theories (The How's and Why's)

It is actually quite easy to encode relational marginal distributions in possibilistic logic. To do that we can use a syntactic variant of the so-called probability-possibility transformation (see e.g. Dubois, Foulloy, Mauris, and Prade, 2003). The syntactic probability-possibility transformation that we used had been introduced on our work (Kuzelka, Davis, and Schockaert, 2016) where we had observed a connection between density estimation trees (Ram and Gray, 2011) and possibilistic logic. In its most basic form, the transformation takes every possible world W on the domain {1, 2, ..., k} and creates a rule \(\neg \wedge\)W that is a negation of W, interpreting W as a conjunction of all atoms that are true in W and also negations of atoms that are false in W (this conjunction is denoted \( \wedge\)W). For example, when k=2, and W = {friends(1, 2), smokes(1)} then \(\neg \wedge\)W = ¬ friends(1, 2) ∨ ¬ smokes(1) ∨ friends(2, 1) ∨ friends(1, 1) ∨ friends(2, 2) ∨ smokes(2). The transformation then takes \(\neg \wedge\)W, constructed in the just described way, and adds to the resulting possibilistic logic theory the weighted rule (\(\neg \wedge\)W, 1-pW) where pW is the probability of W (given by the relational marginal distribution).

The result of the transformation described in the previous paragraph is a possibilistic logic theory. A right question to ask now is in what way it is a model of the relational marginal distribution. Let us have a possible world W whose probability we want to determine. If we add the evidence \(\neg \wedge\)W to the theory that models the distribution, the lowest weight of rules that we need to remove from the theory to make it classically consistent with \(\neg \wedge\)W is 1-pW, i.e. one minus the probability of the possible world W which also happens to be exactly the possibility of W under the standard possibilistic logic semantics, which is based on the principle of least specificity. It follows that whenever we add some evidence to the possibilistic logic theory (note that the evidence does not have to be a complete description of a possible world) then anything that we can derive from the possibilistic logic theory together with the evidence must also be true in all most probable worlds that satisfy the evidence. In other words, the plausible possibilistic logic inference coincides with the well-known maximum a posteriori (MAP) inference. The advantage of possibilistic logic over other formalisms is that it does so in a way that is still very close to classical logic, as anything that is derived from it comes with a classical logic proof. This means that inference in the possibilistic logic models is explainable to humans (at least to those who are familiar with classical logic reasoning).

The transformation described so far is not ideal as it represents merely a list of possible worlds ranked by their probabilities. It is easy to observe that possible worlds that are isomorphic to each other have the same probability (under the relational marginal distributions). Therefore we can replace each set of individual weighted formulas for isomorphic possible worlds by one universally quantified disjunction. For instance, to represent the set of possible worlds isomorphic to W = {friends(1, 2), smokes(1)}, we can use the existentially quantified conjunction ∃ X, Y: ¬ (X = Y) ∧ friends(X, Y) ∧ smokes(X) ∧ ¬ friends(Y, X) ∧ ¬ smokes(Y) ∧ ¬ friends(X, X) ∧ ¬ friends(Y, Y). Its negation (which is the formula that is actually used in the possibilistic logic theory) is the universally quantified disjunction ∀ X, Y: (X = Y) ∨ ¬ friends(X, Y) ∨ ¬ smokes(X) ∨ friends(Y, X) ∨ smokes(Y) ∨ friends(X, X) ∨ friends(Y, Y). This way we can encode exact representations of relational marginal distributions more compactly. In fact this representation is in some sense optimal in the worst case because a relational marginal distribution has N-1 degrees of freedom where N is the number of isomorphism classes of possible worlds on the domain {1, 2, ..., k}.

The possibilistic logic representations can be further simplified by pruning redundant rules and by simplifying those remaining - see (Kuzelka, Davis, and Schockaert, 2016) for details. To illustrate how the theories can be simplified, let us assume that every possible world in which someone is friends with himself, has probability zero. Then the resulting possibilistic logic encoding of the relational marginal distribution will have to entail the rule ∀ X : ¬ friends(X, X). This then allows us to replace the formula ∀ X, Y: (X = Y) ∨ ¬ friends(X, Y) ∨ ¬ smokes(X) ∨ friends(Y, X) ∨ smokes(Y) ∨ friends(X, X) ∨ friends(Y, Y) by the simpler formula ∀ X, Y: (X = Y) ∨ ¬ friends(X, Y) ∨ ¬ smokes(X) ∨ friends(Y, X) ∨ smokes(Y).

Note: The simplification of the theories is performed using efficient SAT solvers (we used Sat4J in our implementation).

Even with the simplifications described in the previous paragraphs, the exact representation of a relational marginal distribution might be intractable and also difficult to actually estimate from data. Therefore we also designed an algorithm that directly learns possibilistic-logic models of relational marginal distributions from data. The learning method consists of a maximum-likelihood estimator for determining the weights of the formulas and a component for learning the actual rules that is based on techniques from the field of inductive logic programming.

Once the models for relational marginal distributions are learned, we can use them for prediction. For instance, we may have a piece of evidence about some objects in the domain, e.g. {friends(alice, bob), smokes(bob), friends(monica, neil), friends(neil, owen)}. Supposing that the width of the learned relational marginal distribution was 3, the evidence that we have here goes beyond the relational marginal distribution, but that does not prevent us from applying the learned possibilistic model. Inference is well defined for the possibilistic logic theory with evidence like this, so we can carry it out as usual. The only thing to keep in mind is that now what we derive does not necessarily have to be a result of MAP inference in the relational marginal, but it is, in a certain sense, an extrapolation of relational marginal MAP inference that is based on the least-specificity principle. Interestingly, when we evaluated this type of inference on relational datasets, the possibilistic-logic method was able to outperform Markov logic networks in terms of accuracy of its prediction from evidence-sets ranging from very small to medium-sized. It is also worth noting that inference in possibilistic logic is typically orders of magnitude faster than in Markov logic networks (for details, see Kuzelka, Davis, and Schockaert, 2017).

Conclusions

Explainable statistical models are important if we want to keep humans in the decision making loop. Possibilistic-logic encoding of relational marginal distributions that we introduced provides such models for relational domains and in a world where everything is connected, well..., when we want to model it, we need relational models.