given a numpy array of the following format:
'aa', '1'
'ab', '1'
'ab', '1'
'ba', '2'
'ba', '2'
I can use numpy.unique to histogram the elements of each column individually. Histogramming the first column gives me the unique count of elements in the first column as follows:
'aa' = 1; 'ab' = 2; 'ba' = 2.
Histogramming the second column will give me
'1' = 3; '2' = 2.
I want to normalize the output of the first column histogram with the output of the second column histogram to give:
'aa' = 1/3; 'ab' = 2/3; 'ba' = 2/2;
Is there a nice way to achieve this?
CodePudding user response:
start by getting the counts and unique values of each number, and turn it into a lookup dict
unique_vals, counts = np.unique(arr[:,1],return_counts=True)
freq = 1 / counts
lookup = dict(zip(unique_vals, freq))
get a numpy array of the corresponding frequencies, then look up each unique letter in the array and sum the frequencies.
freqs = np.vectorize(lookup.get)(arr[:,1])
unique_letters = np.unique(arr[:,0])
{ letter: np.sum(freqs[letter==arr[:,0]]) for letter in unique_letters}
returns
{'aa': 0.3333333333333333, 'ab': 0.6666666666666666, 'ba': 1.0}