Home > Back-end >  Efficiently replace column value from a column of dict from pandas dict
Efficiently replace column value from a column of dict from pandas dict

Time:12-18

I would like help vectorizing my current code, any help or comments are appreiciated I have a df with a weird column that is derived from an availability checker function like this:

original_df = pd.DataFrame({
    'a':['a1', 'a2', 'a3', 'a4'],
    'b':['b1', 'b20', 'b98', 'b4'],
    'c':[{'a':'not_available', 'b': 'b1'}, {}, {'a':'a3', 'b': 'b98'}, {'a':'not_available', 'b': 'not_available'}],
})
a b c
a1 b1 {'a': 'not_available', 'b': 'b11'}
a2 b20 {}
a3 b98 {'a': 'a3', 'b': 'b98'}
a4 b4 {'a': 'not_available', 'b': 'not_available'}

I would like to transform the columns a and b based on the dictionary of column c So, the resulting DF looks something like this:

desired_df = pd.DataFrame({
    'a':['not_available', 'a2', 'a3', 'not_available'],
    'b':['b1', 'b20', 'b98', 'not_available']})
a b
not_available b1
a2 b20
a3 b98
not_available not_available

Some things to note: if the dict in column c is empty, leave the values as they are in other columns. The values in dict of c can only be the current value in the other colum or not_available.

for idx, row in original_df.iterrows():
    for key, value in row.c.items():
        original_df.loc[idx, key] = value

This is a downsampled scenario, the dict contains 8 columns and the df usually has 20-60 rows.

This is my current code and it works but it is very slow. This code is used in an API and my profiler tells me that this function consumes the highest cumulative time. Which makes sense since I'm iterating over everything and I was hoping to get some help!

Shubams answer has made this function go from 20secs to 0.208 seconds. Thank you!

CodePudding user response:

Create an intermediate dataframe from the dicts in column c then mask the values not equal to not_available and use fillna to fill the null values from original df

out = pd.DataFrame([*original_df['c']], original_df.index)
out = out[out == 'not_available'].fillna(original_df)

print(out)
               a              b
0  not_available             b1
1             a2            b20
2             a3            b98
3  not_available  not_available

CodePudding user response:

dict_series = original_df[['a', 'b']].apply(lambda x: dict(x), axis=1)
s = original_df['c']
out = s.where(s.astype(bool), dict_series).apply(lambda x: pd.Series(x))

out

    a               b
0   not_available   b1
1   a2              b20
2   a3              b98
3   not_available   not_available
  • Related