If we have a pandas data frame and a mapping dictionary for the values in the data frame, replacing the values in the data frame using the dictionary as a mapping can be done like so:
In: df
Out:
Col1 Col2
0 a c
1 b c
2 b c
In: key
Out: {'a': 1, 'b': 2, 'c': 3}
In: df.replace(key)
Out:
Col1 Col2
0 1 3
1 2 3
2 2 3
How can a similar transformation be accomplished when the mapping dictionary has lists as values? For example:
In: key
Out: {'a': [1, 0, 0], 'b': [0, 1, 0], 'c': [0, 0, 1]}
In: df.replace(key)
ValueError: NumPy boolean array indexing assignment cannot assign 3 input values to the 1 output values where the mask is true
In this example, the end goal would be to have a new data frame that has 3 rows and 6 columns:
1 0 0 0 0 1
0 1 0 0 0 1
0 1 0 0 0 1
CodePudding user response:
IIUC, you can applymap
explode
reshape:
df2 = df.applymap(key.get).explode(list(df.columns))
df2 = (df2
.set_index(df2.groupby(level=0).cumcount(), append=True)
.unstack(level=1)
)
output:
Col1 Col2
0 1 2 0 1 2
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
NB. to reset the columns: df2.columns = range(df2.shape[1])
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
CodePudding user response:
You can use a combination DataFrame.apply
and Series.map
to perform this substitution. From there, you can perform a DataFrame.sum
to concatenate the lists and then cast your data back into a new DataFrame
out = pd.DataFrame(
df.apply(lambda s: s.map(key)).sum(axis=1).tolist()
)
print(out)
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
Semi-related testing of .sum
vs .chain
:
In [22]: %timeit tmp_df.sum(axis=1)
77.6 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [23]: %timeit tmp_df.apply(lambda row: list(chain.from_iterable(row)), axis=1)
197 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [24]: tmp_df
Out[24]:
Col1 Col2
0 [1, 0, 0] [0, 0, 1]
1 [0, 1, 0] [0, 0, 1]
2 [0, 1, 0] [0, 0, 1]
While I won't say that .sum
is the best method for concatenating lists in a Series, .apply
& chain.from_iterable
doesn't seem to fair much better- at least on a very small sample like this.
CodePudding user response:
Hmm, this is tricky.
One solution I came up with is to convert the lists to their string represention before replacing with them, because pandas treats list
s specially. Then you can use itertools.chain.from_iterable
on each row to combine all the lists into one big list, and create a dataframe out of that:
import ast
from itertools import chain
n = df.replace({k: str(v) for k, v in key.items()}).applymap(ast.literal_eval)
df =pd.DataFrame(n.apply(lambda x: list(chain.from_iterable(x)), axis=1).tolist())
Output:
>>> df
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
CodePudding user response:
Here's a method of replacing the items with lists without looping or stringifying:
df[:] = pd.Series(key)[df.to_numpy().flatten()].to_numpy().reshape(df.shape)
Output:
>>> df
Col1 Col2
0 [1, 0, 0] [0, 0, 1]
1 [0, 1, 0] [0, 0, 1]
2 [0, 1, 0] [0, 0, 1]
Or, you can use explode
and reshape
to convert the data directly to a numpy array:
arr = pd.Series(key)[df.to_numpy().flatten()].explode().to_numpy().reshape(-1, 6) # 6 = len of one of the items of `key` * number of columns in df
Output:
>>> arr
array([[1, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 1]], dtype=object)
>>> pd.DataFrame(arr)
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1