I have a dataframe and I would like to add a column based on the values of the other columns
If the problem were only that, I think a good solution would be this answer However my problem is a bit more complicated
Say I have
import pandas as pd
a= pd.DataFrame([[5,6],[1,2],[3,6],[4,1]],columns=['a','b'])
print(a)
I have
a b
0 5 6
1 1 2
2 3 6
3 4 1
Now I want to add a column called 'result' where each of the values would be the result of applying this function
def process(a,b,c,d):
return {"notthisone":2*a,
"thisone":(a*b c*d),
}
to each of the rows and the next rows of the dataframe
This function is part of a library, it outputs two values but we are only interested in the values of the key thisone
Also, if possible we can not decompose the operations of the function but we have to apply it to the values
For example in the first row
a=5,b=6,c=1,d=2
(c and d being the a and b of the next rows) and we want to add the value "thisone" so 5*6 1*2=32
In the end I will have
a b result
0 5 6 32
1 1 2 20
2 3 6 22
3 4 1 22 --> This is an special case since there is no next row so just a repeat of the previous would be fine
How can I do this?
I am thinking of traversing the dataframe with a loop but there must be a better and faster way...
EDIT:
I have done this so far
def p4(a,b):
return {"notthisone":2*a,
"thisone":(a*b),
}
print(a.apply(lambda row: p4(row.a,row.b)["thisone"], axis=1))
and the result is
0 30
1 2
2 18
3 4
dtype: int64
So now I have to think of a way to incorporate next row values too
CodePudding user response:
If you only need the values of the very next row, I think it would be best to shift these values back into the current row (with different column names). Then they can all be accessed by row-wise apply(fn, axis=1)
.
# library function
def process(a, b, c, d):
return {
"notthisone": 2 * a,
"thisone": (a * b c * d),
}
# toy data
df = pd.DataFrame([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]], columns=["a", "b"])
# shift some data back one row
df[["c", "d"]] = df[["a", "b"]].shift(-1)
# apply your function row-wise
df["result"] = df.apply(
lambda x: process(x["a"], x["b"], x["c"], x["d"])["thisone"], axis=1
)
Result:
a b c d result
0 1.0 2.0 3.0 4.0 14.0
1 3.0 4.0 5.0 6.0 42.0
2 5.0 6.0 7.0 8.0 86.0
3 7.0 8.0
CodePudding user response:
Use loc accessor to select the rows, turn them into a numpy object and find the product and sum. I used list squares in this case. Last row will be Null. fillna the resulting column. We can fillna at the df level but that could impact other columns if the df is large and has nulls. Code below.
a = a.assign(x=pd.Series([np.prod(a.iloc[x].to_numpy()) np.prod(a.iloc[x 1].to_numpy()) for x in np.arange(len(a)) if x!=len(a)-1]))
a =a.assign(x=a['x'].ffill())
a b x
0 5 6 32.0
1 1 2 20.0
2 3 6 22.0
3 4 1 22.0
CodePudding user response:
push the answer to an array and then push the array to a column on the csv/xlxs etc