This post features a dataframe with an asterisk '*'
in some cells that would serve as a placeholder (or wildcard) for all values in the column. For example, the dataframe below,
unique_values = list('ABC')
df = pd.DataFrame({0: list('AB*C*'),
1: np.arange(5) 1})
# df
0 1
0 A 1
1 B 2
2 * 3
3 C 4
4 * 5
would be equivalent to:
# df2
0 1
0 A 1
1 B 2
2 A 3
2 B 3
2 C 3
3 C 4
4 A 5
4 B 5
4 C 5
For further processing, it makes sense to make the placeholder explicit by replacing it with all the values it stands for.
My best attempt is to replace *
with unique_values
, then explode
the column.
As neither .replace('*', unique_values)
nor .map({'*': unique_values})
would work (because the replacement is a list
), I came up with:
df2 = df.applymap(lambda x: unique_values if x == '*' else x).explode(0)
This is good enough for a single column, but it doesn't quite generalize to multiple columns; e.g., to turn
df = pd.DataFrame({0: list('AB*C*'),
1: np.arange(5) 1,
2: list('*X*YZ')})
# df
0 1 2
0 A 1 *
1 B 2 X
2 * 3 *
3 C 4 Y
4 * 5 Z
into
0 1 2
0 A 1 X
0 A 1 Y
0 A 1 Z
1 B 2 X
2 A 3 X
2 A 3 Y
2 A 3 Z
2 B 3 X
2 B 3 Y
2 B 3 Z
2 C 3 X
2 C 3 Y
2 C 3 Z
3 C 4 Y
4 A 5 Z
4 B 5 Z
4 C 5 Z
it's possible to replace the asterisk in each column with the respective unique_values
and then explode, again, each column in turn,
def expand_wildcard(col):
unique_values = [x for x in col.unique() if x != '*']
return col.apply(lambda x: unique_values if x == '*' else x)
df.apply(expand_wildcard).explode(0).explode(2)
which would become increasingly unwieldy. Is there a more pandaesque/vectorized way?
CodePudding user response:
A simple method could be to replace with a set
and use functools.reduce
:
df2 = df.replace(pd.Series({c: {'*': set(df[c]).difference(['*'])} for c in df}))
from functools import reduce
df3 = reduce(lambda a,b: a.join(b), (df2[[c]].explode(c) for c in df2))
output:
0 1 2
0 A 1 Y
0 A 1 Z
0 A 1 X
1 B 2 X
2 A 3 Y
2 A 3 Z
2 A 3 X
2 C 3 Y
2 C 3 Z
2 C 3 X
2 B 3 Y
2 B 3 Z
2 B 3 X
3 C 4 Y
4 A 5 Z
4 C 5 Z
4 B 5 Z