I am trying to create a new data frame that compresses pre-existing columns from another data frame.
I am looking to turn something like this:
id | x1 | x2 | x3 | x4
-------------------------- ...
a | x1a | x2a | x3a | x4a
b | x1b | x2b | x3b | x4b
c | x1c | x2c | x3c | x4c
Into this:
id | z1 | z2
-------------------------------- ...
a | f1(x1a, x2a) | f2(x3a, x4a)
b | f1(x1b, x2b) | f2(x3b, x4b)
c | f1(x1c, x2c) | f2(x3c, x4c)
My current approach has been to continuously just append row by row to the new data frame. Like so:
for row in rows:
new_row_map = get_new_row_map(df_in, row)
df_out = df_out.append(new_row_map, ignore_index=True)
return df_out
I have been running this code for a couple hours now and it seems to be very inefficient. I was wondering if anyone had a quicker/more efficient approach here. Thanks!
CodePudding user response:
You're right, appending row by row to a data is very inefficient, which is why pandas and numpy use vectorized operations to alter and access their data. Data types in numpy and pandas are stored with less metadata than they would be in a base python type, and vectorized operations allow all the calculations to be done at once (for every element) rather than iterating sequentially through each row. See Chapter 4 of Python for Data Analysis for a more thorough explanation (it's free online).
Rather than appending row by row, you need to apply a vectorized function to the whole data frame (meaning it alters the entire data frame at once instead of iterating over the rows). For instance:
df["z1"] = f1(df)
df["z2"] = f2(df)
#examples of what f1 and f2 could be
def f1(df):
result = (df["x1"] * df["x2"] 4) np.cos(df["x2"]))
return result
def f2(df):
df["x3"] - df["x4"] * 9.8
# you could cut out the original columns like so
df = df[["z1", "z2"]]
See this post about vectorizing a function, and this article
CodePudding user response:
You can use:
def f1(row):
# do stuff here, just return a string for demo
return f"f({', '.join(row)})"
def f2(row):
# do stuff here, just return a string for demo
return f"f({', '.join(row)})"
df['z1'] = df[['x1', 'x2']].apply(f1, axis=1)
df['z2'] = df[['x3', 'x4']].apply(f2, axis=1)
Output:
id x1 x2 x3 x4 z1 z2
0 a x1a x2a x3a x4a f(x1a, x2a) f(x3a, x4a)
1 b x1b x2b x3b x4b f(x1b, x2b) f(x3b, x4b)
2 c x1c x2c x3c x4c f(x1c, x2c) f(x3c, x4c)