Marginal Density Probability using np-CodePudding

    P = np.array(
    [
        [0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
         0.03776221, 0.00131325, 0.03760817, 0.01770659],
        [0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
         0.04778769, 0.01021053, 0.00324185, 0.02475319],
        [0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
         0.02187814, 0.01925662, 0.0196836 , 0.01996279],
        [0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
         0.02352593, 0.00300314, 0.00103487, 0.04071951],
        [0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
         0.02032679, 0.02536328, 0.03552956, 0.01107725]
    ]
)

Here's the dataset where X corresponds to rows and Y corresponds to columns. I'm trying to figure out how to calculate the Covariance and Marginal Density Probability (MDP) for Y (columns). for the Covariance I have the following.

np.cov(P)
array([[ 2.247e-04,  6.999e-05,  2.571e-05, -2.822e-05,  1.061e-04],
       [ 6.999e-05,  2.261e-04,  9.535e-07,  8.165e-05, -2.013e-05],
       [ 2.571e-05,  9.535e-07,  7.924e-05,  1.357e-05, -8.118e-05],
       [-2.822e-05,  8.165e-05,  1.357e-05,  2.039e-04, -1.267e-04],
       [ 1.061e-04, -2.013e-05, -8.118e-05, -1.267e-04,  2.372e-04]])

How do I get the MDP? Also, is there a way to use numpy to select just the X and Y vals and assign them to variables where X= P's rows and Y=P's columns?

CodePudding user response：

The data stored in P are a little ambiguous. In statistics, the X and Y have a very specific meaning. Usually, each row refers to one observation (i.e. datapoint) of some statistical object while the column represents a feature that is measured for each statistical object. In your case, there would be 9 observations with 5 features. This is referred to as a

2. Nonparametric density estimation

Alternatively, you can use a histogram as a non-parametric estimator of the unknown probability density functions (of each column/feature). Note, however, that you still have to choose a bandwidth h that determines the width of the bins. Additionally, non-parametric tools require a larger sample size to provide accurate estimates. Your sample size 9 is likely insufficient.

import numpy as np
from matplotlib import pyplot as plt

# endpoints on all bins: implies bandwith h=0.00229
bins = np.linspace(0.0, 0.055, 25) 
h = np.diff(bins)[0]

# histograms
for i in range(5):
    plt.hist(P[:,i], bins, alpha=0.5, label='Col. {}'.format(i 1))

plt.suptitle('Marginal Distribution Estimates', fontsize=21, y=1.025)
plt.title('nonparametric via histograms (h={})'.format(round(h, 4)), fontsize=14)
plt.legend(loc='upper right')
plt.show()