I've recently run into a problem. I have data looking like this :
Value 1 | Value 2 | Target |
---|---|---|
1345 | 4590 | 2.45 |
1278 | 3567 | 2.48 |
1378 | 4890 | 2.46 |
1589 | 4987 | 2.50 |
... | ... | ... |
The data goes on for a few thousand lines.
I need to find two values (A & B), that minimize the error when the data is inputted like so :
Value 1 * A Value 2 * B = Target
I've looked into scipy.optimize.curve_fit, but I can't seem to understand how it would work, because the function changes at every iteration of the data (since Value 1 and Value 2 are not the same over every row).
Any help is greatly appreciated, thanks in advance !
CodePudding user response:
The function curve_fit
takes 3 arguments :
- a function
f
that takes an input argument, let's call itX
and parametersparams
(as many as you want) - the input
X_data
you have from your dataset - the output
Y_data
you have from your dataset
The point of this function is the give you best params
to input in f(X_data, params)
to get Y_data
.
Intuitively the form X
in your function f
is a simple numpy 1D array, but actually it can have the form you want. Here your input a tuple of two 1D arrays (or a 2D array if you want to implemente it this way).
Here's a code example :
import numpy as np
from scipy.optimize import curve_fit
X_data = (np.array([1345,1278,1378,1589]),
np.array([4590,3567,4890,4987]))
Y_data = np.array([2.45,2.48,2.46,2.50])
def my_func(X, A, B):
x1, x2 = X
return A*x1 B*x2
(A, B), _ = curve_fit(my_func, X_data, Y_data)
interpolated_results = my_func(X_data, A, B)
relative_error_in_percent = abs((Y_data - interpolated_results)/Y_data)*100
print(relative_error_in_percent)
CodePudding user response:
Unfortunataly you have not provided any test data so I have come up with my own:
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
def f(V1,V2,A,B): #Target function
return V1*A V2*B
# Generate Test-Data
def generateData(A,B):
np.random.seed(0)
V1=np.random.uniform(low=1000, high=1500, size=(100,))
V2=np.random.uniform(low=3500, high=5000, size=(100,))
Target=f(V1,V2,A,B) np.random.normal(0,1,100)
return V1,V2,Target
data=generateData(2,3) #Important:
data={"Value 1":data[0], "Value 2":data[1], "Target":data[2]}
df=pd.DataFrame(data) #Similar structure as given in Table
df.head()
looks like this:
Value 1 Value 2 Target
0 1292.0525763109854 3662.162080896163 13570.276523473405
1 1155.0421489258965 4907.133274663096 17033.392287295104
2 1430.7172112685223 4844.422515098364 17395.412651006143
3 1396.0480757043242 4076.5845114488666 15022.720636830541
4 1346.2120476329646 3570.9567326419674 13406.565815022896
Your question is answered in the following:
## Plot Data to check whether linear function is useful
df.head()
fig=plt.figure()
ax1=fig.add_subplot(211)
ax2=fig.add_subplot(212)
ax1.scatter(df["Value 1"], df["Target"])
ax2.scatter(df["Value 2"], df["Target"])
def fmin(x, df): #Returns Error at given parameters
def RMSE(y,y_target): #Definition for error term
return np.sqrt(np.mean((y-y_target)**2))
A,B=x
V1,V2,y_target=df["Value 1"], df["Value 2"], df["Target"]
y=f(V1,V2,A,B) #Calculate target value with given parameter set
return RMSE(y,y_target)
res=minimize(fmin,x0=[1,1],args=df, options={"disp":True})
print(res.x)
I prefere scipy.optimize.minimize()
over curve_fit
since you can define the error function yourself. The documentation can be found here.
You need:
- a function
fun
that returns the error for a given set of parameterx
(herefmin
with RMSE) - an initial guess
x0
(here[1,1]
), if your guess is totally off you will probably do not find a solution or (with more complex problems) just a local one - additional arguments
args
provided to thefun
here the datadf
but also helpful for fixed parameters options={"disp":True}
is for printing additional information- your parameters can be found besides further information in the returned variable
res
For this case the result is:
[1.9987209 3.0004212]
Similar to the given parameters when generating the data.