pyhton pandas mapping the columns names series with map to a dict & handling missing values-CodePudding

I would like to apply the map function to the columns of a dataframe as follows:

d = {'one': [1, 2], 'two': [3, 4], 'three':[3,3]}
df = pd.DataFrame(data=d)

recodes = {'one':'A', 'two':'B'}
c = df.columns.map(recodes)
c
#result: Index(['A', 'B', nan], dtype='object')

All fine

Now I would like to apply another kind of dict, with values being tuples:

recodes2 = {'one':('A','H'), 'two':('AB','HB')}
c = df.columns.map(recodes2)
c

This does not work. the error being:

TypeError: Expected tuple, got float

EXPECTED OUTPUT:

MultiIndex([(       'A',        'H'),
            (      'AB',       'HB'),
            ('_unknown', '_unknown')],
           )

The question is tripple:

ON one side why is this so? why dont I get a (nan,nan) automatic default value.
secondly how to avoid this problem
and thridly how to include a kind of default value, saying for instance a tupple ("_unknown","_unknown") for values not being part of the dictionary.

I look for a kind of more pythonic answer than getting the set of values of the columns and modify the dict in order to include the default values for all keys not originally present in the dict.

A POSSIBLE SOLUTION IS:

d = {'one': [1, 2], 'two': [3, 4],'three':[6,7],'four':[8,9]}
df = pd.DataFrame(data=d)

# original recodes dict
recodes3 = {'one':('one','A','H'), 'two':('two','AB','HB')}

# complete recodes dict
missing_values = [col for col in df.columns if col not in recodes3.keys()]
print(missing_values)
recodes_missing = {k:(k,'_unknown','_unknown') for k in missing_values}
#complete the recode dict:
recodes4 = {**recodes3,**recodes_missing}
print(recodes4)
c = df.columns.map(recodes4)
c

But there should be a way to treat missing values in the map pandas map function ( I guess)

CodePudding user response：

First question

Why don't I get a (nan,nan) automatic default value.

The function is Index.map(mapper, na_action=None). For the default behaviour when a value is not present as a key in the dict, compare the documentation for pd.Series.map:

When arg [i.e. == mapper inside Index.map] is a dictionary, values in Series that are not in the dictionary (as keys) are converted to NaN.

So, this is just what the function does: convert the missing key to a NaN value. It doesn't "care" whether the other keys are converted to tuples or whatever other type. Subsequently, you run into an error further down the road, when python tries to run MultiIndex.from_tuples(new_values, names=names), while new_values will look like [('A','H'), ('AB','HB'), np.nan].

Second and third question

[H]ow to avoid this problem

[H]ow to include a kind of default value

Let's take these two together, because the way to avoid the problem indeed consists of supplying a default value (of sorts). Here are three options. Third one is an alternative to map.

Option 1

Instead of using df.columns.map(recodes2), create a lambda function, and apply dict.get, which allows a default value to be passed, if the key is missing from the dict:

c = df.columns.map(lambda x: recodes2.get(x,("_unknown","_unknown")))
c

MultiIndex([(       'A',        'H'),
            (      'AB',       'HB'),
            ('_unknown', '_unknown')],
           )

Option 2

Instead of using a regular dict, use a defaultdict. This dictionary-like object adds the missing key to the dict if it is missing with a default value, and then that value gets uses inside map. E.g. we can do as follows:

from collections import defaultdict
recodes2 = {'one':('A','H'), 'two':('AB','HB')}

def def_value():
    return ("_unknown","_unknown")

# create `defaultdict` and add `update` with `recodes2` 
my_def_dict = defaultdict(def_value)
my_def_dict.update(recodes2)

print(my_def_dict)

defaultdict(<function def_value at 0x000001F1050F0CA0>, 
            {'one': ('A', 'H'), 
             'two': ('AB', 'HB')})

c = df.columns.map(my_def_dict)
print(c)

MultiIndex([(       'A',        'H'),
            (      'AB',       'HB'),
            ('_unknown', '_unknown')],
           )

# note that we have now added the key to the dict as well! May be useful, may not be
print(my_def_dict)

defaultdict(<function def_value at 0x000001F105124280>, 
            {'one': ('A', 'H'), 
             'two': ('AB', 'HB'), 
             'three': ('_unknown', '_unknown')})

Option 3

Instead of relying on map, we can also just use pd.MultiIndex.from_tuples on a list of tuples, created with a list comprehension. The benefit here is that we can increment the quasi-default values (or: stub names; e.g. f'_unknown_{int}'), so that you end up with unique column names also for the values that are not present in the dict. E.g.:

# let's add another value, `four` to `df.columns`
d = {'one': [1, 2], 'two': [3, 4], 'three':[3,3], 'four':[4,4]}
df = pd.DataFrame(data=d)

# create a list from which to `pop` the first value consecutively
ints = list(np.repeat([*range(1, len(df.columns) 1)],2))
# [1, 1, 2, 2, 3, 3, 4, 4]

# use list comprehension inside `MultiIndex.from_tuples`
c = pd.MultiIndex.from_tuples([recodes2[col] 
                               if col in recodes2 
                               else (f'_unknown_{ints.pop(0)}',
                                     f'_unknown_{ints.pop(0)}') 
                               for col in df.columns])
c

MultiIndex([(         'A',          'H'),
            (        'AB',         'HB'),
            ('_unknown_1', '_unknown_1'),
            ('_unknown_2', '_unknown_2')],
           )