I would like help vectorizing my current code, any help or comments are appreiciated I have a df with a weird column that is derived from an availability checker function like this:
original_df = pd.DataFrame({
'a':['a1', 'a2', 'a3', 'a4'],
'b':['b1', 'b20', 'b98', 'b4'],
'c':[{'a':'not_available', 'b': 'b1'}, {}, {'a':'a3', 'b': 'b98'}, {'a':'not_available', 'b': 'not_available'}],
})
a | b | c |
---|---|---|
a1 | b1 | {'a': 'not_available', 'b': 'b11'} |
a2 | b20 | {} |
a3 | b98 | {'a': 'a3', 'b': 'b98'} |
a4 | b4 | {'a': 'not_available', 'b': 'not_available'} |
I would like to transform the columns a
and b
based on the dictionary of column c
So, the resulting DF looks something like this:
desired_df = pd.DataFrame({
'a':['not_available', 'a2', 'a3', 'not_available'],
'b':['b1', 'b20', 'b98', 'not_available']})
a | b |
---|---|
not_available | b1 |
a2 | b20 |
a3 | b98 |
not_available | not_available |
Some things to note: if the dict in column c
is empty, leave the values as they are in other columns.
The values in dict of c
can only be the current value in the other colum or not_available
.
for idx, row in original_df.iterrows():
for key, value in row.c.items():
original_df.loc[idx, key] = value
This is a downsampled scenario, the dict contains 8 columns and the df usually has 20-60 rows.
This is my current code and it works but it is very slow. This code is used in an API and my profiler tells me that this function consumes the highest cumulative time. Which makes sense since I'm iterating over everything and I was hoping to get some help!
Shubams answer has made this function go from 20secs to 0.208 seconds. Thank you!
CodePudding user response:
Create an intermediate dataframe from the dicts in column c
then mask the values not equal to not_available
and use fillna to fill the null values from original df
out = pd.DataFrame([*original_df['c']], original_df.index)
out = out[out == 'not_available'].fillna(original_df)
print(out)
a b
0 not_available b1
1 a2 b20
2 a3 b98
3 not_available not_available
CodePudding user response:
dict_series = original_df[['a', 'b']].apply(lambda x: dict(x), axis=1)
s = original_df['c']
out = s.where(s.astype(bool), dict_series).apply(lambda x: pd.Series(x))
out
a b
0 not_available b1
1 a2 b20
2 a3 b98
3 not_available not_available