How do I add a column to a dataframe based on values from other columns?-CodePudding

I have a dataframe and I would like to add a column based on the values of the other columns

If the problem were only that, I think a good solution would be this answer However my problem is a bit more complicated

Say I have

import pandas as pd
a= pd.DataFrame([[5,6],[1,2],[3,6],[4,1]],columns=['a','b'])
print(a)

I have

Now I want to add a column called 'result' where each of the values would be the result of applying this function

def process(a,b,c,d):
    return {"notthisone":2*a,
            "thisone":(a*b c*d),
            }

to each of the rows and the next rows of the dataframe

This function is part of a library, it outputs two values but we are only interested in the values of the key thisone Also, if possible we can not decompose the operations of the function but we have to apply it to the values

For example in the first row a=5,b=6,c=1,d=2 (c and d being the a and b of the next rows) and we want to add the value "thisone" so 5*6 1*2=32

In the end I will have

   a  b  result
0  5  6   32
1  1  2   20
2  3  6   22
3  4  1   22 --> This is an special case since there is no next row so just a repeat of the previous would be fine

How can I do this?

I am thinking of traversing the dataframe with a loop but there must be a better and faster way...

EDIT:

I have done this so far

def p4(a,b):
    return {"notthisone":2*a,
            "thisone":(a*b),
            }

print(a.apply(lambda row: p4(row.a,row.b)["thisone"], axis=1))

and the result is

0    30
1     2
2    18
3     4
dtype: int64

So now I have to think of a way to incorporate next row values too

CodePudding user response：

If you only need the values of the very next row, I think it would be best to shift these values back into the current row (with different column names). Then they can all be accessed by row-wise apply(fn, axis=1).

# library function
def process(a, b, c, d):
    return {
        "notthisone": 2 * a,
        "thisone": (a * b   c * d),
    }

# toy data
df = pd.DataFrame([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]], columns=["a", "b"])
# shift some data back one row
df[["c", "d"]] = df[["a", "b"]].shift(-1)
# apply your function row-wise
df["result"] = df.apply(
    lambda x: process(x["a"], x["b"], x["c"], x["d"])["thisone"], axis=1
)

Result:

    a   b   c   d   result
0   1.0 2.0 3.0 4.0 14.0
1   3.0 4.0 5.0 6.0 42.0
2   5.0 6.0 7.0 8.0 86.0
3   7.0 8.0

CodePudding user response：

Use loc accessor to select the rows, turn them into a numpy object and find the product and sum. I used list squares in this case. Last row will be Null. fillna the resulting column. We can fillna at the df level but that could impact other columns if the df is large and has nulls. Code below.

a = a.assign(x=pd.Series([np.prod(a.iloc[x].to_numpy())   np.prod(a.iloc[x 1].to_numpy()) for x in np.arange(len(a)) if x!=len(a)-1]))
a =a.assign(x=a['x'].ffill())



    a   b   x
0   5   6   32.0
1   1   2   20.0
2   3   6   22.0
3   4   1   22.0

CodePudding user response：

push the answer to an array and then push the array to a column on the csv/xlxs etc