Replacing values of pandas data frame to lists using dictionary mapping?-CodePudding

If we have a pandas data frame and a mapping dictionary for the values in the data frame, replacing the values in the data frame using the dictionary as a mapping can be done like so:

In: df
Out:
  Col1 Col2
0    a    c
1    b    c
2    b    c

In: key
Out: {'a': 1, 'b': 2, 'c': 3}

In: df.replace(key)
Out: 
   Col1  Col2
0     1     3
1     2     3
2     2     3

How can a similar transformation be accomplished when the mapping dictionary has lists as values? For example:


In: key
Out: {'a': [1, 0, 0], 'b': [0, 1, 0], 'c': [0, 0, 1]}

In: df.replace(key)
ValueError: NumPy boolean array indexing assignment cannot assign 3 input values to the 1 output values where the mask is true

In this example, the end goal would be to have a new data frame that has 3 rows and 6 columns:


1 0 0 0 0 1
0 1 0 0 0 1
0 1 0 0 0 1

CodePudding user response：

IIUC, you can applymap explode reshape:

df2 = df.applymap(key.get).explode(list(df.columns))
df2 = (df2
 .set_index(df2.groupby(level=0).cumcount(), append=True)
 .unstack(level=1)
)

output:

  Col1       Col2      
     0  1  2    0  1  2
0    1  0  0    0  0  1
1    0  1  0    0  0  1
2    0  1  0    0  0  1

NB. to reset the columns: df2.columns = range(df2.shape[1])

   0  1  2  3  4  5
0  1  0  0  0  0  1
1  0  1  0  0  0  1
2  0  1  0  0  0  1

CodePudding user response：

You can use a combination DataFrame.apply and Series.map to perform this substitution. From there, you can perform a DataFrame.sum to concatenate the lists and then cast your data back into a new DataFrame

out = pd.DataFrame(
    df.apply(lambda s: s.map(key)).sum(axis=1).tolist()
)

print(out)
   0  1  2  3  4  5
0  1  0  0  0  0  1
1  0  1  0  0  0  1
2  0  1  0  0  0  1

Semi-related testing of .sum vs .chain:

In [22]: %timeit tmp_df.sum(axis=1)
77.6 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [23]: %timeit tmp_df.apply(lambda row: list(chain.from_iterable(row)), axis=1)
197 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [24]: tmp_df
Out[24]: 
        Col1       Col2
0  [1, 0, 0]  [0, 0, 1]
1  [0, 1, 0]  [0, 0, 1]
2  [0, 1, 0]  [0, 0, 1]

While I won't say that .sum is the best method for concatenating lists in a Series, .apply & chain.from_iterable doesn't seem to fair much better- at least on a very small sample like this.

CodePudding user response：

Hmm, this is tricky.

One solution I came up with is to convert the lists to their string represention before replacing with them, because pandas treats lists specially. Then you can use itertools.chain.from_iterable on each row to combine all the lists into one big list, and create a dataframe out of that:

import ast
from itertools import chain

n = df.replace({k: str(v) for k, v in key.items()}).applymap(ast.literal_eval)
df  =pd.DataFrame(n.apply(lambda x: list(chain.from_iterable(x)), axis=1).tolist())

Output:

>>> df
   0  1  2  3  4  5
0  1  0  0  0  0  1
1  0  1  0  0  0  1
2  0  1  0  0  0  1

CodePudding user response：

Here's a method of replacing the items with lists without looping or stringifying:

df[:] = pd.Series(key)[df.to_numpy().flatten()].to_numpy().reshape(df.shape)

Output:

>>> df
        Col1       Col2
0  [1, 0, 0]  [0, 0, 1]
1  [0, 1, 0]  [0, 0, 1]
2  [0, 1, 0]  [0, 0, 1]

Or, you can use explode and reshape to convert the data directly to a numpy array:

arr = pd.Series(key)[df.to_numpy().flatten()].explode().to_numpy().reshape(-1, 6) # 6 = len of one of the items of `key` * number of columns in df

Output:

>>> arr
array([[1, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1]], dtype=object)
       
>>> pd.DataFrame(arr)
   0  1  2  3  4  5
0  1  0  0  0  0  1
1  0  1  0  0  0  1
2  0  1  0  0  0  1