Home > other >  How to expand a placeholder '*' with all possible values in a dataframe?
How to expand a placeholder '*' with all possible values in a dataframe?

Time:05-11

This post features a dataframe with an asterisk '*' in some cells that would serve as a placeholder (or wildcard) for all values in the column. For example, the dataframe below,

unique_values = list('ABC')
df = pd.DataFrame({0: list('AB*C*'), 
                   1: np.arange(5) 1})

# df
   0  1
0  A  1
1  B  2
2  *  3
3  C  4
4  *  5

would be equivalent to:

# df2
   0  1
0  A  1
1  B  2
2  A  3
2  B  3
2  C  3
3  C  4
4  A  5
4  B  5
4  C  5

For further processing, it makes sense to make the placeholder explicit by replacing it with all the values it stands for.

My best attempt is to replace * with unique_values, then explode the column. As neither .replace('*', unique_values) nor .map({'*': unique_values}) would work (because the replacement is a list), I came up with:

df2 = df.applymap(lambda x: unique_values if x == '*' else x).explode(0)

This is good enough for a single column, but it doesn't quite generalize to multiple columns; e.g., to turn

df = pd.DataFrame({0: list('AB*C*'), 
                   1: np.arange(5) 1, 
                   2: list('*X*YZ')})

# df
   0  1  2
0  A  1  *
1  B  2  X
2  *  3  *
3  C  4  Y
4  *  5  Z

into


   0  1  2
0  A  1  X
0  A  1  Y
0  A  1  Z
1  B  2  X
2  A  3  X
2  A  3  Y
2  A  3  Z
2  B  3  X
2  B  3  Y
2  B  3  Z
2  C  3  X
2  C  3  Y
2  C  3  Z
3  C  4  Y
4  A  5  Z
4  B  5  Z
4  C  5  Z

it's possible to replace the asterisk in each column with the respective unique_values and then explode, again, each column in turn,

def expand_wildcard(col):
    unique_values = [x for x in col.unique() if x != '*']
    return col.apply(lambda x: unique_values if x == '*' else x)

df.apply(expand_wildcard).explode(0).explode(2)

which would become increasingly unwieldy. Is there a more pandaesque/vectorized way?

CodePudding user response:

A simple method could be to replace with a set and use functools.reduce:

df2 = df.replace(pd.Series({c: {'*': set(df[c]).difference(['*'])} for c in df}))
from functools import reduce
df3 = reduce(lambda a,b: a.join(b), (df2[[c]].explode(c) for c in df2))

output:

   0  1  2
0  A  1  Y
0  A  1  Z
0  A  1  X
1  B  2  X
2  A  3  Y
2  A  3  Z
2  A  3  X
2  C  3  Y
2  C  3  Z
2  C  3  X
2  B  3  Y
2  B  3  Z
2  B  3  X
3  C  4  Y
4  A  5  Z
4  C  5  Z
4  B  5  Z
  • Related