Understanding the event and evidence parameters in bnlearns cpquery-CodePudding

I followed the example for cpquery(fitted, event, evidence, cluster, method = "ls", ..., debug = FALSE) given in the documentation:

#rm(list=ls())
library(bnlearn)

data(learning.test)
fitted = bn.fit(hc(learning.test), learning.test)
# the result should be around 0.025.
cpquery(fitted, (B == "b"), (A == "a"))
# out: 0.02692665

It worked as advertised, but I have some confusion about the syntax. For reference, head(learning.test) looks like this:

  A B C D E F
1 b c b a b b
2 b a c a b b
3 a a a a a a
4 a a a a b b
5 a a b c a a
6 c c a c c a

and the total number of rows is 5000.

Now to my confusion: The second argument to cpquery, (B == "b"), is passed to the event parameter and the third argument (A == "a") is passed to the evidence parameter. The documentation says about these two parameters that

The event and evidence arguments must be two expressions describing the event of interest and the conditioning evidence in a format such that, if we denote with data the data set the network was learned from, data[evidence, ] and data[event, ] return the correct observations.

I didn't quite understand how the syntax (B == "b") worked, since I thought B is not defined in the current scope. So given what was said in the documentation above, I took data to be learning.test in my case and then I tried to do learning.test[(B == "b"),], which indeed resulted in the error

Error in `[.data.frame`(learning.test, (B == "b"), ) : 
  object 'B' not found

Given the documentation I thought this line of code should've worked.

I then tried to do learning.test[(learning.test$B=="b"),] which does seem to return a data frame with all rows where the B-column has the value "b". I then tried to use this syntax in the cpquery function:

cpquery(fitted, (learning.test$B == "b"), (learning.test$A == "a"))

but to my surprise this resulted in the error

Error in sampling(fitted = fitted, event = event, evidence = evidence,  : 
  logical vector for evidence is of length 5000 instead of 10000.

I don't understand why I get this error (other than that something goes wrong in the logical sampling used by cpquery), since as far as I can see I am conforming to the quoted passage from the documentation.

So in conclusion I wonder why does the example from the documentation work when it seems to not conform to the quoted passage from the documentation, and why does my modification not work even though it seems to conform to the quoted passage in the documentation? (I am not very familiary with R so maybe this is very basic).

CodePudding user response：

I think what that section is just trying to relate what the evidence and event expressions represent in the original data and how they relate to logic sampling. e.g. if you subset the original data with the event (evidence), then it will subset the data to only contains rows that agree with the event (evidence).

But you are correct that the notation data[evidence, ] won't work. You can uselearning.test[eval(expression(B == "b"), learning.test), ], which may help you understand the intent. This is similar to what used to appear in older help files.

For logic sampling (the default inference method) a bunch of samples are generated from the BN, the default is N=10000 samples (*). So 1) you have estimated a parameterised BN, 2) draw N samples from it, 3) subset this sample according to event / evidence vectors to calculate some conditional probability (hopefully correctness is not lost in my brevity).

cpquery(fitted, (learning.test$B == "b"), (learning.test$A == "a")) doesn't work as the default number of samples in the sampled BN (step 2. above) is 10000 but learning.test has 5000 rows. You can force your approach to work, (by work I mean produce an answer), by setting the n parameter , cpquery(fitted, (learning.test$B == "b"), (learning.test$A == "a"), n=5000) but I am not sure this will always do what you expect (and I think that it should probably throw an error). Best to stick with the documented approaches.

(*) From the help I'd expect the default number of samples to be 5000 * log10(nparams(fitted))) =~ 8000, but perhaps batch = 10000 takes precedence.