Modifying columns from existing data frame into new data frame-CodePudding

I am trying to create a new data frame that compresses pre-existing columns from another data frame.

I am looking to turn something like this:

id | x1  | x2  | x3  | x4
-------------------------- ...
a  | x1a | x2a | x3a | x4a
b  | x1b | x2b | x3b | x4b
c  | x1c | x2c | x3c | x4c

Into this:

id |     z1       |      z2
-------------------------------- ...
a  | f1(x1a, x2a) | f2(x3a, x4a) 
b  | f1(x1b, x2b) | f2(x3b, x4b) 
c  | f1(x1c, x2c) | f2(x3c, x4c)

My current approach has been to continuously just append row by row to the new data frame. Like so:

for row in rows:
   new_row_map = get_new_row_map(df_in, row)
   df_out = df_out.append(new_row_map, ignore_index=True) 
return df_out

I have been running this code for a couple hours now and it seems to be very inefficient. I was wondering if anyone had a quicker/more efficient approach here. Thanks!

CodePudding user response：

You're right, appending row by row to a data is very inefficient, which is why pandas and numpy use vectorized operations to alter and access their data. Data types in numpy and pandas are stored with less metadata than they would be in a base python type, and vectorized operations allow all the calculations to be done at once (for every element) rather than iterating sequentially through each row. See Chapter 4 of Python for Data Analysis for a more thorough explanation (it's free online).

Rather than appending row by row, you need to apply a vectorized function to the whole data frame (meaning it alters the entire data frame at once instead of iterating over the rows). For instance:

df["z1"] = f1(df)
df["z2"] = f2(df)

#examples of what f1 and f2 could be
def f1(df):
    result = (df["x1"] * df["x2"]   4)   np.cos(df["x2"]))
    return result

def f2(df):
    df["x3"] - df["x4"] * 9.8

# you could cut out the original columns like so
df = df[["z1", "z2"]]

See this post about vectorizing a function, and this article

CodePudding user response：

You can use:

def f1(row):
    # do stuff here, just return a string for demo
    return f"f({', '.join(row)})"
    
def f2(row):
    # do stuff here, just return a string for demo
    return f"f({', '.join(row)})"

df['z1'] = df[['x1', 'x2']].apply(f1, axis=1)
df['z2'] = df[['x3', 'x4']].apply(f2, axis=1)

Output:

  id   x1   x2   x3   x4           z1           z2
0  a  x1a  x2a  x3a  x4a  f(x1a, x2a)  f(x3a, x4a)
1  b  x1b  x2b  x3b  x4b  f(x1b, x2b)  f(x3b, x4b)
2  c  x1c  x2c  x3c  x4c  f(x1c, x2c)  f(x3c, x4c)