I have two dataframes df1 and df2 each of them have column containing product code and product price, I wanted to check the difference between prices in the 2 dataframes and store the result of this function I created in a new dataframe "df3" containing the product code and the final price, Here is my attempt :
Function to calculate the difference in the way I want:
def range_calc(z, y):
final_price = pd.DataFrame(columns = ["Market_price"])
res = z-y
abs_res = abs(res)
if abs_res == 0:
return (z)
if z>y:
final_price = (abs_res / z ) * 100
else:
final_price = (abs_res / y ) * 100
return(final_price)
For loop I created to check the two df and use the function:
Last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
for i in df1["product_ID"]:
for x in df2["product_code"]:
if i == x:
Last_df["Product_number"] = i
Last_df["Market_Price"] = range_calc(df1["full_price"],df2["tot_price"])
The problem is that I am getting this error every time:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
CodePudding user response:
sometimes when you check a condition on series or dataframes, your output is a series such as ( , False). In this case you must use any, all, item,... use print function for your condition to see the series.
Also I must tell your code is very very slow and has O(n**2). You can first calculate df3 as joining df1 and df2 then using apply method for fast calculating.
CodePudding user response:
Why you got the error message "The truth value of a Series is ambiguous
"
You got the error message The truth value of a Series is ambiguous
because you tried input a pandas.Series
into an if
-clause
nums = pd.Series([1.11, 2.22, 3.33])
if nums == 0:
print("nums == zero (nums is equal to zero)")
else:
print("nums != zero (nums is not equal to zero)")
# AN EXCEPTION IS RAISED!
The Error Message is something like the following:
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
Somehow, you got a Series
into the inside of the if
-clause.
Actually, I know how it happened, but it will take me a moment to explain:
Well, suppose that you want the value out of row 3 and column 4 of a pandas dataframe.
If you attempt to extract a single value out of a pandas table in a specific row and column, then that value is sometimes a Series
object, not a number.
Consider the following example:
# DATA:
# Name Age Location
# 0 Nik 31 Toronto
# 1 Kate 30 London
# 2 Evan 40 Kingston
# 3 Kyra 33 Hamilton
To create the dataframe above, we can write:
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
Now, let us try to get a specific row of data:
evans_row = df.loc[df['Name'] == 'Evan']
and we try to get a specific value out of that row of data:
evans_age = evans_row['Age']
You might think that evans_age
is the integer 40
, but you would be wrong.
Let us see what evans_age
really is:
print(80*"*", "EVAN\'s AGE", type(Evans_age), sep="\n")
print(Evans_age)
We have:
EVAN's AGE
<class 'pandas.core.series.Series'>
2 40
Name: Age, dtype: int64
Evan's Age is not a number.
evans_age
is an instance of the class stored as pandas.Series
After extracting a single cell out of a pandas dataframe you can write .tolist()[0]
to extract the number out of that cell.
evans_real_age = evans_age.tolist()[0]
print(80*"*", "EVAN\'s REAL AGE", type(evans_real_age), sep="\n")
print(evans_real_age)
EVAN's REAL AGE
<class 'numpy.int64'>
40
The exception in your original code was probably thrown by if abs_res == 0
.
If abs_res
is a pandas.Series
then abs_res == 0
returns another Series.
There is no way to compare if an entire list of numbers is equal to zero.
Normally people just enter one input to an if-clause.
if (912):
print("912 is True")
else:
print("912 is False")
When an if-statement receives more than one value, then the python interpreter does not know what to do.
For example, what should the following do?
import pandas as pd
data = pd.Series([1, 565, 120, 12, 901])
if data:
print("data is true")
else:
print("data is false")
You should only input one value into an if-condition. Instead, you entered a pandas.Series
object as input to the if-clause.
In your case, the pandas.Series
only had one number in it. However, in general, pandas.Series
contain many values.
The authors of the python pandas
library assume that a series contains many numbers, even if it only has one.
The computer thought that you tired to put many different numbers inside of one single if
-clause.
The difference between a "function definition" and a "function call"
Your original question was,
"I want to make a function if the common key is found"
Your use of the phrase "make a function" is incorrect. You probably meant, "I want to call a function if a common key is found."
The following are all examples of function "calls":
import pandas as pd
import numpy as np
z = foo(1, 91)
result = funky_function(811, 22, "green eggs and ham")
output = do_stuff()
data = np.random.randn(6, 4)
df = pd.DataFrame(data, index=dates, columns=list("ABCD"))
Suppose that you have two containers.
If you truly want to "make" a function if a common key is found, then you would have code like the following:
dict1 = {'age': 26, 'phone':"303-873-9811"}
dict2 = {'name': "Bob", 'phone':"303-873-9811"}
def foo(dict1, dict2):
union = set(dict2.keys()).intersection(set(dict1.keys()))
# if there is a shared key...
if len(union) > 0:
# make (create) a new function
def bar(*args, **kwargs):
pass
return bar
new_function = foo(dict1, dict2)
print(new_function)
If you are not using the def
keyword, that is known as a function call
In python, you "make" a function ("define" a function) with the def
keyword.
I think that your question should be re-titled.
You could write, "How do I call a function if two pandas dataframes have a common key?"
A second good question be something like,
"What went wrong if we see the error message,
ValueError: The truth value of a Series is ambiguous.
?"
Your question was worded strangely, but I think I can answer it.
Generating Test Data
Your question did not include test data. If you ask a question on stack overflow again, please provide a small example of some test data.
The following is an example of data we can use:
product_ID full_price
0 prod_id 1-1-1-1 11.11
1 prod_id 2-2-2-2 22.22
2 prod_id 3-3-3-3 33.33
3 prod_id 4-4-4-4 44.44
4 prod_id 5-5-5-5 55.55
5 prod_id 6-6-6-6 66.66
6 prod_id 7-7-7-7 77.77
------------------------------------------------------------
product_code tot_price
0 prod_id 3-3-3-3 34.08
1 prod_id 4-4-4-4 45.19
2 prod_id 5-5-5-5 56.30
3 prod_id 6-6-6-6 67.41
4 prod_id 7-7-7-7 78.52
5 prod_id 8-8-8-8 89.63
6 prod_id 9-9-9-9 100.74
- Products 1 and 2 are unique to data-frame 1
- Products 8 and 9 are unique to data-frame 2
- Both data-frames contain data for products 3, 4, 5, ..., 7.
The prices are slightly different between data-frames.
The test data above is generated by the following code:
import pandas as pd
from copy import copy
raw_data = [
[
"prod_id {}-{}-{}-{}".format(k, k, k, k),
int("{}{}{}{}".format(k, k, k, k))/100
] for k in range(1, 10)
]
raw_data = [row for row in raw_data]
df1 = pd.DataFrame(data=copy(raw_data[:-2]), columns=["product_ID", "full_price"])
df2 = pd.DataFrame(data=copy(raw_data[2:]), columns=["product_code", "tot_price"])
for rowid in range(0, len(df2.index)):
df2.at[rowid, "tot_price"] = 0.75
print(df1)
print(60*"-")
print(df2)
Add some error checking
It is considered to be "best-practice" to make sure that your function inputs are in the correct format.
You wrote a function named range_calc(z, y)
. I reccomend making sure that z
and y
are integers, and not something else (such as a pandas Series
object).
def range_calc(z, y):
try:
z = float(z)
y = float(y)
except ValueError:
function_name = inspect.stack()[0][3]
with io.StringIO() as string_stream:
print(
"Error: In " function_name "(). Inputs should be
like decimal numbers.",
"Instead, we have: " str(type(y)) " \'"
repr(str(y))[1:-1] "\'",
file=string_stream,
sep="\n"
)
err_msg = string_stream.getvalue()
raise ValueError(err_msg)
# DO STUFF
return
Now we get error messages:
import pandas as pd
data = pd.Series([1, 565, 120, 12, 901])
range_calc("I am supposed to be an integer", data)
# ValueError: Error in function range_calc(). Inputs should be like
decimal numbers.
# Instead, we have: <class 'str'> "I am supposed to be an integer"
Code which Accomplishes what you Wanted.
The following is some rather ugly code which computes what you wanted:
# You can continue to use your original `range_calc()` function unmodified
# Use the test data I provided earlier in this answer.
def foo(df1, df2):
last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
df1_ids = set(df1["product_ID"].tolist())
df2_ids = set(df2["product_code"].tolist())
pids = df1_ids.intersection(df2_ids) # common_product_ids
for pid in pids:
row1 = df1.loc[df1['product_ID'] == pid]
row2 = df2.loc[df2["product_code"] == pid]
price1 = row1["full_price"].tolist()[0]
price2 = row2["tot_price"].tolist()[0]
price3 = range_calc(price1, price2)
row3 = pd.DataFrame([[pid, price3]], columns=["Product_number", "Market_Price"])
last_df = pd.concat([last_df, row3])
return last_df
# ---------------------------------------
last_df = foo(df1, df2)
The result is:
Product_number Market_Price
0 prod_id 6-6-6-6 1.112595
0 prod_id 7-7-7-7 0.955171
0 prod_id 4-4-4-4 1.659659
0 prod_id 5-5-5-5 1.332149
0 prod_id 3-3-3-3 2.200704
Note that one of many reasons that my solution is ugly is in the following line of code:
last_df = pd.concat([last_df, row3])
if last_df
is large (thousands of rows), then the code will run very slowly.
This is because instead of inserting a new row of data, we:
- copy the original dataframe
- append a new row of data to the copy.
- delete/destroy the original data-frame.
It is really silly to copy 10,000 rows of data only to add one new value, and then delete the old 10,000 rows.
However, my solution has fewer bugs than your original code, relatively speaking.