"Spotting" probability density functions of distributions programmatically (Symbolic Toolb-CodePudding

I have a joint probability density f(x,y,z) and I wish to find the conditional distribution X|Y=y,Z=z, which is equivalent to treating x as data and y and z as parameters (constants).

For example, if I have X|Y=y,Z=z being the pdf of a N(1-2y,3z^2 2), the function would be:

syms x y z
f(y,z) = 1/sqrt(2*pi*(3*z^2 2)) * exp(-1/(2*(3*z^2 2)) * (x-(1-2*y))^2);

I would like to compare it to the following:

syms mu s L a b
Normal(mu,s) = (1/sqrt(2*pi*s^2)) * exp(-1/(2*s^2) * (x-mu)^2);
Exponential(L) = L * exp(-L*x);
Gamma(a,b) = (b^a / gamma(a)) * x^(a-1)*exp(-b*x);
Beta(a,b) = (1/beta(a,b)) * x^(a-1)*(1-x)^(b-1);

Question

How do I make a program whichDistribution that would be able to print which of these four, f is equivalent to (up to proportionality) with respect to the variable x, and what are the parameters? E.g. f and x as above, the distribution is Normal, mu=1-2*y, s=3*z^2 2.

NB: there would not always be a unique solution, since some distributions are are equivalent (e.g. Gamma(1,L)==Exponential(L))

Desired outputs

syms x y z
f = 1/sqrt(2*pi*(3*z^2 2)) * exp(-1/(2*(3*z^2 2)) * (x-(1-2*y))^2)
whichDistribution(f,x) %Conditional X|Y,Z
% Normal(1-2*y,3*z^2 2)

syms x y
f = y^(1/2)*exp(-(x^2)/2 - y/2 * (1 (4-x)^2 (6-x)^2)) % this is not a pdf because it is missing a constant of proportionality, but it should still work

whichDistribution(f,x)  %Conditional X|Y
% Normal(10*y/(2*y 1), 1/(2*y 1))

whichDistribution(f,y)  %Conditional Y|X
% Gamma(3/2, x^2 - 10*x   53/2)

f = exp(-x) %also missing a constant of proportionality
whichDistribution(f,x)
% Exponential(1)

f = 1/(2*pi)*exp(-(x^2)/2 - (y^2)/2)
whichDistribution(f,x)
% Normal(0,1)
whichDistribution(f,y)
% Normal(0,1)

What I have tried so far:

Using solve():

q = solve(f(y,z) == Normal(mu,s), mu, s)

Which gives wrong results, since parameters can't depend on x:

>> q.mu
ans =
(z1^2*(log((2^(1/2)*exp(x^2/(2*z1^2) - (x   2*y - 1)^2/(6*z^2   4)))/(2*pi^(1/2)*(3*z^2   2)^(1/2)))   pi*k*2i))/x
>> q.s
ans =
z1

Attempting to simplify f(y,z) up to proportionality (in x variable) using a propto() function that I wrote:

>> propto(f(y,z),x)
ans =
exp(-(x*(x   4*y - 2))/(2*(3*z^2   2)))

>> propto(Normal(mu,s),x)
ans =
exp((x*(2*mu - x))/(2*s^2))

This is almost on the money, since it is easy to spot that s^2=3*z^2 2 and 2*mu=-(4*y - 2), but I don't know how to deduce this programmatically.

In case it is useful: propto(f,x) attempts to simplify f by dividing f by children of f which don't involve x, and then output whichever form has the least number of children. Here is the routine:

function out = propto(f,x)
oldf = f;
newf = propto2(f,x);
while (~strcmp(char(oldf),char(newf))) % if the form of f changed, do propto2 again. When propto2(f) == f, stop
    oldf = newf;
    newf = propto2(oldf,x);
end
out = newf;
end

function out = propto2(f,x)
t1 = children(expand(f)); % expanded f
i1 = ~has([t1{:}],x);
out1 = simplify(f/prod([t1{i1}])); % divides expanded f by terms that do not involve x

t2 = children(f); % unexpanded f
i2 = ~has([t2{:}],x);
out2 = simplify(f/prod([t2{i2}])); % divides f by terms that do not involve x

A = [f, symlength(f); out1, symlength(out1); out2, symlength(out2)];
A = sortrows(A,2); % outputs whichever form has the fewest number of children
out = A(1,1);
end

function L = symlength(f)
% counts the number of children of f by repeatingly applying children() to itself
t = children(f);
t = [t{:}];
L = length(t);
if (L == 1)
    return
end
oldt = f;
while(~strcmp(char(oldt),char(t)))
    oldt = t;
    t = children(t);
    t = [t{:}];
    t = [t{:}];
end
L = length(t);
end

edit: added desired outputs

edit2: clarified the desired function

CodePudding user response：

I came to a possible solution in the case of multiplicative separability: G(x,y,z) = g(x)*h(y,z) In which case, we divide the function by the various PDF's appropriately scaled using the moments of the input distribution. Note: Here, x is the "special" variable. In the case that you're looking for f(Y|x,z) instead of f(X|y,z): well.. for the moment, you'll have to rename the variables. The variable that we will consider is "x".

I use sympy and Python because I don't have access to matlab (you said in a comment that it would still be OK). As far as I remember, it should be pretty straightforward to implement it using the Matlab's Symbolic Math Toolbox.

Notes:

It would probably be possible to extend the code for additive separability: (G,x,y,z) = g(x) h(y,z)
It can only work if the supplied function can be fully integrated on all variables on the specified domain (which is the case for a valid PDF) because we use the moments of the distribution.
The method relies on our ability to express the parameters of the PDF's in terms of the moments of the distribution (mu1, mu2): mu1=E(x*f(x,y,z)) and mu2=E((x-mu1)**2*f(x,y,z))

For the Normal(x; mu, sigma): it's trivial ==> mu=mu1, sigma=mu2

For the Exponential(x; lambda): also trivial ==> lambda = mu1

For the Gamma(x; alpha, lambda): less trivial but ==> alpha = mu1**2/mu2, lambda= mu1/mu2

For the Beta_typeI(x; a,b): much less trivial... I show how I did it in the appendix section in the code.

The main trick is that, once the parameters of the PDF's have been expressed with the moments of the input functions, they should match the input function (up to a multiplicative term). The Normal(x; mu,sigma) case is the most trivial illustration of that idea.

Now, what happens for a PDF like the generalized gamma distribution?

GeneralizedGamma(x; alpha, lambda, tau) = tau*(x/lambda)**(alpha*tau-1)*exp(-(x/lambda)**tau)/lambda/gamma(alpha)

Well... it has 3 parameters. So, we would need 3 moments (mu1, mu2, mu3). Then express these 3 parameters in terms of the 3 moments and this... in a CLOSED FORM! I have no idea if this possible or not. My first guess would be that it is unlikely. I don't know. I haven't tried. By the way, this is why the program below only deals with 1 or 2 parameters PDF.

Also, for each PDF, we need to do a bit of work. A work which could probably be "mechanized" but it would require quiet a lot of programming.

It's better (easier) to use the program in a Notebook and to put each section in a different cell.
The program has not been extensively tested. If you find a bug, please, don't yell at me. I'd be glad to hear of it and try to correct my answer (if I'm able to...) :-)

Running the examples contained in the code:

1) f = 1/sqrt(2*pi*5/2) * exp(-(3*z**2)/2 - x**2/2-(2*y)**2)
Result:
f(X|y,z) is a Normal(x;mu,sigma) distribution with parameters: 0 and 1.00000000000000

2) f=27*exp(-x/3 1)*exp(-(y**2 z**2))
(Note: I defined the exp distrib. as 1/lambda*exp(-x/lambda))
Result:
f(X|y,z) is a Exp(x;lambda) distribution with parameters: 3.00000000000000
f(X|y,z) is a Gamma(x; alpha, lambda) distribution with parameters: 1.00000000000000 and 0.333333333333333
which is correct since Gamma(x;alpha=1, lambda) is indeed the Exp(x; lambda) distribution

3) f=11/13*x*exp(-x/2)*exp(-(y**2 z**2)) #(alpha=2, Lambda=1/2)
Result:
f(X|y,z) is a Gamma(x; alpha, lambda) distribution with parameters: 2.00000000000000 and 0.500000000000000

4) f=22/13*(x)**(3/2)*(1-x)**(4/3)*exp(-(y**2 z**2))
Result:
f(X|y,z) is a Beta(x; a,b) distribution with parameters: 2.50000000000000 and 2.33333333333333

5) f=3.7*exp(-x/2-y**2) 
Result:
f(X|y,z) is a Gamma(x; alpha, lambda) distribution with parameters: 1.00000000000000 and 0.500000000000000
# Notes: 
- Since here we don't have z, I should print "f(X|y)" instead of "f(X|y,z)". 
- This time it finds the Gamma with alpha=1 but not the exponential. The answer is nevertheless correct but... Why? No idea.

In short, there is room for improvement even if I'm not fully convinced that it is the best way to solve your problem. If you have questions, I'll be glad to try to answer them.

The code:

# SECTION: IMPORTS AND SYMBOLS

import sympy
import re
from sympy import oo, integrate, simplify, symbols, Rational,                   Abs, expand, solve, beta, exp, sqrt, pi,                   gamma
# Symbols used:
x,y,z, Xmin,Ymin,Zmin, Xmax, Ymax, Zmax= symbols('x y z Xmin Ymin Zmin Xmax Ymax Zmax',real=True)
mu1 = symbols('mu1', real=True) # First moment of the distribution
mu2 = symbols('mu2', real=True, positive=True)  # Second moment of the distrib. always >0
C=symbols('C', real=True, positive = True)

epsilon = 1e-10 # tolerance for sympy.nsimplify()

# SECTION: DEFINITIONS OF PDF:

# Distributions with parameters expressed in terms of moments mu1 and mu2:
# See APPENDIX for an example with the Beta_typeI function
# Note: Distributions which have more than 2 parameters will need more moments
# and I'm not sure if the parameters can always be expressed in terms of mu1, mu2, ... 
# in a closed form. Probably not...

Normal = 1/sqrt(2*pi*mu2) * exp(-(x-mu1)**2/2/mu2);
Exponential = 1/mu1*exp(-x/mu1); #  # note the definition which is different from the L*exp(-L*x) 
Gamma=(mu1/mu2)**(mu1**2/mu2)/gamma(mu1**2/mu2)*x**(mu1**2/mu2-1)*exp(-mu1/mu2*x) #  mu1=alpha/lambda, mu2=alpha/lambda**2
Beta_typeI=x**(mu1*(-mu1**2   mu1 - mu2)/mu2 - 1)*(1 - x)**(-1   (1 - mu1)*(-mu1**2   mu1 - mu2)/mu2)/beta(mu1*(-mu1**2   mu1 - mu2)/mu2, (1 - mu1)*(-mu1**2   mu1 - mu2)/mu2)

#         [Function, Name , param(1), param(2)] (expressed in terms of mu1, mu2 )

pdf_list=[[Normal,'Normal(x;mu,sigma)', mu1, mu2], \
          [Exponential, 'Exp(x;lambda)', mu1,mu2], \
          [Gamma, 'Gamma(x; alpha, lambda)', mu1**2/mu2, mu1/mu2],\
          [Beta_typeI, 'Beta(x; a,b)', mu1*(-mu1**2 mu1-mu2)/mu2,-(mu1 - 1)*(-mu1**2   mu1 - mu2)/mu2 ]]

# SECTION: FUNCTIONS
    
def fullIntegration(f, Domain):
    # Detect the number of variables and perform full integration on the domain
    if f.has(x) and f.has(y) and f.has(z): # f has 3 variables x,y,z
        Xmin, Xmax, Ymin, Ymax, Zmin, Zmax = Domain
        return sympy.re(integrate(integrate(integrate(f, (x,Xmin,Xmax)),(y,Ymin,Ymax)),(z,Zmin,Zmax)))

    elif f.has(x) and f.has(y):
        Xmin, Xmax, Ymin, Ymax = Domain[0:4] # f has 2 variables x,y
        return sympy.re(integrate(integrate(f, (x,Xmin,Xmax)),(y,Ymin,Ymax)))

    elif f.has(x): 
        Xmin, Xmax= Domain[0:2] # f has 1 variable x
        return sympy.re(integrate(f, (x,Xmin,Xmax)))
    else:
        print("ERROR: f must at least have x as an argument")
        return None

def check_if_pdf(f,Domain):
    # ------------------------------------------------------------------------------
    # We need to check if f is indeed a pdf. Otherwise, we can't compute moments.
    # ------------------------------------------------------------------------------
    # Note: we only check that full integration of f over the variables is finite.
    # In principle, we should check that f is positive everywhere but it is strangely
    # difficult (sympy can't find critical points in some cases). So we leave it
    # to the user and assume that the first check is enough.
    
    Integral=fullIntegration(f, Domain)
    if Abs(Integral) == oo:
        print("This does not represent a valid prob. density. func. :")
        print("Int over domain is infinite.")
        print("Sorry, we stop here.")
        return [False, f, -1, -1]
    else:
        # We normalize f
        C = 1/Integral
        f = simplify(C*f)   
        # Compute first and second moments:
        Mu1=fullIntegration(x*f, Domain)
        Mu2=fullIntegration((x-Mu1)**2*f,Domain)
        return [True, f, Mu1, Mu2]

def find_pdf(expr, Domain, pdf_list):
    # First we check if expr is a pdf
    isPDF, expr, Mu1, Mu2 =check_if_pdf(expr, Domain)
    if isPDF:
        for w in pdf_list:
            pdf=w[0].subs(mu1,Mu1).subs(mu2,Mu2)
            pdf_name=w[1]
            
            has_more_than_one_argument=pdf_name.find(',')
            # A bit sketchy... To solve the case of single parameter PDF (like exp()
            if has_more_than_one_argument<0:
                has_more_than_one_argument=False
            else:
                has_more_than_one_argument=True
                
            arg1=w[2].subs(mu1,Mu1).subs(mu2,Mu2)
            if has_more_than_one_argument:
                arg2=w[3].subs(mu1,Mu1).subs(mu2,Mu2)
            else:
                arg2=arg1 # dummy arg2
            
            if has_more_than_one_argument:
                if not expand(sympy.nsimplify(simplify(expr)/simplify(pdf),tolerance=epsilon)).has(x):
                    if arg2.evalf()!=0: # No valid pdf has ẑero variance 
                        print("f(X|y,z) is a " pdf_name " distribution with parameters: " str(arg1.evalf()) " and " str(arg2.evalf()))
            else:
                if not expand(sympy.nsimplify(simplify(expr)/simplify(pdf),tolerance=epsilon)).has(x):
                    if arg2.evalf()!=0: # No valid pdf has ẑero variance 
                        print("f(X|y,z) is a " pdf_name " distribution with parameters: " str(arg1.evalf()))


# SECTION: RUNNING THE EXAMPLES

# Choose among the five examples:

# Ex. 1) 
f = 1/sqrt(2*pi*5/2) * exp(-(3*z**2)/2 - x**2/2-(2*y)**2)
Domain=[-oo, oo,-oo, oo, -oo, oo] # Domain for f (xmin, xmax, ...,zmin,zmax)

# Ex. 2)
#f=27*exp(-x/3 1)*exp(-(y**2 z**2))
#Domain=[0, oo,-oo, oo,-oo, oo] 

# Ex. 3) (using gamma_alpha_lambda with lambda=.5, alpha=3)  # We don't care about normalization
# Note; Gamma = Lambda**alpha/gamma(alpha)*x**(alpha-1)*exp(-Lambda*x)
#f=11/13*x*exp(-x/2)*exp(-(y**2 z**2)) #(alpha=2, Lambda=1/2)
#Domain=[0, oo,-oo, oo,-oo, oo]

# Ex. 4) using Beta_typeI distribution with a=5/2, b=7/3
# Note: Beta_typeI = (1/beta(a,b)) * x**(a-1)*(1-x)**(b-1);
#f=22/13*(x)**(3/2)*(1-x)**(4/3)*exp(-(y**2 z**2))
#Domain=[0,1,-oo, oo,-oo, oo] # domain for x,y,z

# Ex. 5) # To show that we don't need x,y,z 
#f=3.7*exp(-x/2-y**2)
#Domain=[0,  oo,-oo, oo] # We only specify domain for x ,y


find_pdf(f,Domain, pdf_list)
print()
print("END")
print("-----------------------------------------------------------------")

# END of program
# -----------------------------------------------------------------------


# SECTION: APPENDIX

print("-----------------------------------------------------------------")
print("APPENDIX (comments in the code)")
print("-----------------------------------------------------------------")
# APPENDIX:
# Goal: To express Beta_typeI(x; a,b) as a functions of (x; mu1, mu2)
# For the function: Beta_typeI = 1/beta(a,b) * x**(a-1)*(1-x)**(b-1) # with a,b>0 and 0<x<1
# Here things become more complicated...
# Again, we need to express Beta_typeI as a function of x, mu1 and mu2:
# For E(X) and Var(X) see page 40 in the Handbook on probability distributions:
# https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/inst/doc/probdistr-main.pdf?revision=24&root=distributions&pathrev=24
# E(X) = mu1 = a/(a b) and Var(X) = mu2 = a*b/(a b)^2/(a b 1)
# We solve for a,b in terms of mu1, mu2
a,b = symbols('a b', real=True, positive=True)
mu1 = symbols('mu1', real=True)
mu2 = symbols('mu2', real=True, positive=True)
print(solve(mu1-a/(a b),b))
# Gives: b = -a*(mu1-1)/mu1
# substitute b in expression for mu2:
print(simplify((a*b/(a b)**2/(a b 1)).subs(b, -a*(mu1-1)/mu1)))
# Gives: mu2=mu1**2*(1-mu1)/(a mu1)
# Solve for a:
print("a = ",solve(mu2-mu1**2*(1-mu1)/(a mu1),a)[0])
# Gives: a=mu1*(-mu1**2 mu1-mu2)/mu2
# Substitute a in expression for b:
print("b = ",(-a*(mu1-1)/mu1).subs(a, mu1*(-mu1**2 mu1-mu2)/mu2))
# Gives: b= -(mu1 - 1)*(-mu1**2   mu1 - mu2)/mu2

# Definition of the Beta_typeI function:
Beta_typeI = 1/beta(a,b) * x**(a-1)*(1-x)**(b-1) # with a,b>0 and 0<x<1
# Substituting a and b in Beta_typeI gives:

Beta_typeI=Beta_typeI.subs(a, mu1*(-mu1**2 mu1-mu2)/mu2).subs(b,-(mu1 - 1)*(-mu1**2   mu1 - mu2)/mu2) 
#print(Beta_typeI)
# We substitute in the above expression:
# Gives: x**(mu1*(-mu1**2   mu1 - mu2)/mu2 - 1)*(1 - x)**(-1   (1 - mu1)*(-mu1**2   mu1 - mu2)/mu2)/beta(mu1*(-mu1**2   mu1 - mu2)/mu2, (1 - mu1)*(-mu1**2   mu1 - mu2)/mu2)
print("Beta_typeI = ",Beta_typeI)

CodePudding user response：

I have managed to solve my own problem using solve() from Symbolic Toolbox. There were two issues with my original approach: I needed to set up n simultaneous equations for n parameters, and the solve() doesn't cope well with exponentials:

solve(f(3) == g(3), f(4) == g(4), mu,s)

yields no solutions, but

logf(x) = feval(symengine,'simplify',log(f),'IgnoreAnalyticConstraints');
logg(x) = feval(symengine,'simplify',log(g),'IgnoreAnalyticConstraints');
solve(logf(3) == logg(3), logf(4) == logg(4), mu,s)

yields good solutions.

Solution

Given f(x), for each PDF g(x) we attempt to solve simultaneously

log(f(r1)) == log(g(r1)) and log(f(r2)) == log(g(r2))

for some simple non-equal numbers r1, r2. Then output the solution with lowest complexity.

The code is:

function whichDist(f,x)

syms mu s L a b x1 x2

f = propto(f,x); % simplify up to proportionality
logf(x) = feval(symengine,'simplify',log(f),'IgnoreAnalyticConstraints');
Normal(mu,s,x) = propto((1/sqrt(2*pi*s)) * exp(-1/(2*s) * (x-mu)^2),x);
Exponential(L,x) = propto(L * exp(-L*x), x);
Gamma(a,b,x) = propto((b^a / gamma(a)) * x^(a-1)*exp(-b*x), x);
Beta(a,b,x) = propto((1/beta(a,b)) * x^(a-1)*(1-x)^(b-1), x);

best_sol = {'none', inf};

%% check Exponential:
logexp = feval(symengine,'simplify',log(Exponential(L,x)),'IgnoreAnalyticConstraints');
logf = logf(x);
if (propto(logf,x) == x) % pdf ~ exp(K*x), can read off Lambda directly
    fprintf('\nExponential: rate L = %s\n\n', -logf/x);
    return
end
logf(x) = logf;

%% check Normal:
logn(x) = feval(symengine,'simplify',log(Normal(mu,s,x)),'IgnoreAnalyticConstraints');
r1 = randi(10); r2 = randi(10);
while (r1 == r2) r1 = randi(10); r2 = randi(10); end
try % attempt to solve the equation
    % solve simultaneously for two random non-equal integer values r1,r2
    qn = solve(logf(r1) == logn(r1), logf(r2) == logn(r2), mu, s);
catch error
end
if (exist('qn','var')) % if solve() managed to run
    if (~isempty(qn.mu) && ~isempty(qn.s)) % if solution exists
        complexity = symlength(qn.mu)   symlength(qn.s);
        if complexity < best_sol{2} % store best solution so far
            best_sol{1} = sprintf('Normal: mu = %s, s^2 = %s', qn.mu, qn.s);
            best_sol{2} = complexity;
        end
    end
end

%% check Gamma:
logg(x) = feval(symengine,'simplify',log(Gamma(a,b,x)),'IgnoreAnalyticConstraints');
try
    qg = solve(logf(r1) == logg(r1), logf(r2) == logg(r2), a, b);
catch error
end
if (exist('qg','var'))
    if (~isempty(qg.a) && ~isempty(qg.b))
        complexity = symlength(qg.a)   symlength(qg.b);
        if complexity < best_sol{2}
            best_sol{1} = sprintf('Gamma: a = %s, b = %s', qg.a, qg.b);
            best_sol{2} = complexity;
        end
    end
end

%% check Beta:
logb(x) = feval(symengine,'simplify',log(Beta(a,b,x)),'IgnoreAnalyticConstraints');
r1 = randi(9)/10; r2 = randi(9)/10; % 0<r<1 if r is from Beta(a,b)
try
    qb = solve(logf(r1) == logb(r1), logf(r2) == logb(r2), a, b);
catch error
end
if (exist('qb','var'))
    if (~isempty(qb.a) && ~isempty(qb.b))
        complexity = symlength(qb.a)   symlength(qb.b);
        if complexity < best_sol{2}
            best_sol{1} = sprintf('Beta: a = %s, b = %s', qb.a, qb.b);
            best_sol{2} = complexity;
        end
    end
end

%% Print solution that has lowest complexity
fprintf('\n%s\n\n', best_sol{1});

end

Tests:

>> syms x y z
>> f = y^(1/2)*exp(-(x^2)/2 - y/2 * (1 (4-x)^2 (6-x)^2))
>> whichDist(f,x)
Normal: mu = (10*y)/(2*y   1), s^2 = 1/(2*y   1)
>> whichDist(f,y)
Gamma: a = 3/2, b = x^2 - 10*x   53/2
>> Beta(a,b,x) = propto((1/beta(a,b)) * x^(a-1)*(1-x)^(b-1), x);
>> f = Beta(1/z   7*y/(1-sqrt(z)), z/y   1/(1-z), x)
Beta: a = -(7*y*z - z^(1/2)   1)/(z*(z^(1/2) - 1)), b = -(y   z - z^2)/(y*(z - 1))

All correct.

Sometimes bogus answers if the parameters are numeric:

whichDist(Beta(3,4,x),x)
Beta: a = -(pi*log(2)*1i   pi*log(3/10)*1i - log(2)*log(3/10)   log(2)*log(7/10) - log(3/10)*log(32)   log(2)*log(1323/100000))/(log(2)*(log(3/10) - log(7/10))), b = (pi*log(2)*1i   pi*log(7/10)*1i   log(2)*log(3/10) - log(2)*log(7/10) - log(7/10)*log(32)   log(2)*log(1323/100000))/(log(2)*(log(3/10) - log(7/10)))

So there is room for improvement and I will still award bounty to a better solution than this.