How to process a dictionary of strings in python-CodePudding

I have a dictionary of values that follow this string pattern informationGain_$index$ and threshold_$index$. My goal is to retrieve the maximum informationGain_$index$ and threshold_$index$.

An example dictionary looks like so:

{'informationGain_0': 0.9949486404805016, 'threshold_0': 5.0, 'informationGain_1': 0.9757921620455572, 'threshold_1': 12.5, 'informationGain_2': 0.7272727272727273, 'threshold_2': 11.5, 'informationGain_3': 0.5509775004326937, 'threshold_3': 8.6, 'informationGain_4': 0.9838614413637048, 'threshold_4': 7.0, 'informationGain_5': 0.9512050593046015, 'threshold_5': 6.0, 'informationGain_6': 0.8013772106338303, 'threshold_6': 5.9, 'informationGain_7': 0.9182958340544896, 'threshold_7': 1.5, 'informationGain_8': 0.0, 'threshold_8': 9.0, 'informationGain_9': 0.6887218755408672, 'threshold_9': 7.8, 'informationGain_10': 0.9182958340544896, 'threshold_10': 2.1, 'informationGain_11': 0.0, 'threshold_11': 13.5}

I written code to generate the dataset.

def entropy_discretization(s):

    I = {}
    i = 0
    while(uniqueValue(s)):
        # Step 1: pick a threshold
        threshold = s['A'].iloc[0]

        # Step 2: Partititon the data set into two parttitions
        s1 = s[s['A'] < threshold]
        print("s1 after spitting")
        print(s1)
        print("******************")
        s2 = s[s['A'] >= threshold]
        print("s2 after spitting")
        print(s2)
        print("******************")
            
        # Step 3: calculate the information gain.
        informationGain = information_gain(s1,s2,s)
        I.update({f'informationGain_{i}':informationGain,f'threshold_{i}': threshold})
        print(f'added informationGain_{i}: {informationGain}, threshold_{i}: {threshold}')
        s = s[s['A'] != threshold]
        i  = 1

    print(I)

Given the example dataset, the maximum information gain is associated with threshold_0 and informationGain_0. I would like to find a general way of identifying these key values pairs from the dataset. Is there a way to search the dictionary such that I can return informationGain_*,threshold_* such that informationGain_* == max?

CodePudding user response：

Here is a solution using a custom key with max. It works even if the dictionary is not sorted. This is assuming the input dictionary is named d.

M = max((k for k in d if k.startswith('i')),
        key=lambda x: d[x])
T = f'threshold_{M.rsplit("_")[-1]}'
out = {M: d[M], T: d[T]}

Output:

{'informationGain_0': 0.9949486404805016, 'threshold_0': 5.0}

NB. I used a simple test on the dictionary keys to check those that start with i in order to identify the informationGain_X keys. If you have a more complex real life dictionary, you might want to update this to use a full match or any other way to make identification of the key non ambiguous.

CodePudding user response：

I've also found a way of doing this. It just took a few tries

    n = int(((len(I)/2)-1))
    print("Calculating maximum threshold")
    print("*****************************")
    maxInformationGain = 0
    maxThreshold       = 0 
    for i in range(0, n):
        if(I[f'informationGain_{i}'] > maxInformationGain):
            maxInformationGain = I[f'informationGain_{i}']
            maxThreshold       = I[f'threshold_{i}']

    print(f'maxThreshold: {maxThreshold}, maxInformationGain: {maxInformationGain}')

CodePudding user response：

One way to do this is as follows:

assuming your dictionary name is d

informationGain_max = max(list(d.values())[::2])
threshold_max = max(list(d.values())[1::2])

this only works under the assumption that since python 3.6 standard dict maintains the order of insertions.

CodePudding user response：

Lets make a list, and each member of that list should be a tuple or list that contains two elements: first the information gain, and then the threshold. We can sort this list with either the .sort() method of the list or by using the sorted() function. The last tuple of the sorted list will contain the values you seek. If you are also interested in the index of these values then add their index as a third element of the tuples.