Home > Net >  I want to make a function if the common key is found in both dataframes
I want to make a function if the common key is found in both dataframes

Time:07-19

I have two dataframes df1 and df2 each of them have column containing product code and product price, I wanted to check the difference between prices in the 2 dataframes and store the result of this function I created in a new dataframe "df3" containing the product code and the final price, Here is my attempt :

Function to calculate the difference in the way I want:

def range_calc(z, y):
  final_price = pd.DataFrame(columns = ["Market_price"])
  res = z-y
  abs_res = abs(res)
  if abs_res == 0:
    return (z)
  if z>y:
    final_price = (abs_res / z ) * 100
  else:
    final_price = (abs_res / y ) * 100
  return(final_price)

For loop I created to check the two df and use the function:

Last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
for i in df1["product_ID"]:
  for x in df2["product_code"]:
    if i == x:
      Last_df["Product_number"] = i
      Last_df["Market_Price"] = range_calc(df1["full_price"],df2["tot_price"])

The problem is that I am getting this error every time:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

CodePudding user response:

sometimes when you check a condition on series or dataframes, your output is a series such as ( , False). In this case you must use any, all, item,... use print function for your condition to see the series.

Also I must tell your code is very very slow and has O(n**2). You can first calculate df3 as joining df1 and df2 then using apply method for fast calculating.

CodePudding user response:

Why you got the error message "The truth value of a Series is ambiguous"

You got the error message The truth value of a Series is ambiguous because you tried input a pandas.Series into an if-clause

nums = pd.Series([1.11, 2.22, 3.33])
if nums == 0:
    print("nums  ==  zero (nums is equal to zero)") 
else:
    print("nums  !=  zero (nums is not equal to zero)")

# AN EXCEPTION IS RAISED!   

The Error Message is something like the following:

ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().

Somehow, you got a Series into the inside of the if-clause.

Actually, I know how it happened, but it will take me a moment to explain:

Well, suppose that you want the value out of row 3 and column 4 of a pandas dataframe.

If you attempt to extract a single value out of a pandas table in a specific row and column, then that value is sometimes a Series object, not a number.

Consider the following example:

# DATA:
#    Name  Age  Location
# 0   Nik   31   Toronto
# 1  Kate   30    London
# 2  Evan   40  Kingston
# 3  Kyra   33  Hamilton

To create the dataframe above, we can write:

df = pd.DataFrame.from_dict({
    'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
    'Age': [31, 30, 40, 33],
    'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})

Now, let us try to get a specific row of data:

evans_row = df.loc[df['Name'] == 'Evan']

and we try to get a specific value out of that row of data:

evans_age = evans_row['Age']

You might think that evans_age is the integer 40, but you would be wrong.

Let us see what evans_age really is:

print(80*"*", "EVAN\'s AGE", type(Evans_age), sep="\n")
print(Evans_age)

We have:

EVAN's AGE
<class 'pandas.core.series.Series'>
2    40
Name: Age, dtype: int64

Evan's Age is not a number. evans_age is an instance of the class stored as pandas.Series

After extracting a single cell out of a pandas dataframe you can write .tolist()[0] to extract the number out of that cell.

evans_real_age = evans_age.tolist()[0]
print(80*"*", "EVAN\'s REAL AGE", type(evans_real_age), sep="\n")
print(evans_real_age)
EVAN's REAL AGE
<class 'numpy.int64'>
40

The exception in your original code was probably thrown by if abs_res == 0.

If abs_res is a pandas.Series then abs_res == 0 returns another Series.

There is no way to compare if an entire list of numbers is equal to zero.

Normally people just enter one input to an if-clause.

if (912):
    print("912 is True")
else:
    print("912 is False")

When an if-statement receives more than one value, then the python interpreter does not know what to do.

For example, what should the following do?

import pandas as pd

data = pd.Series([1, 565, 120, 12, 901])
if data:
   print("data is true")
else:
   print("data is false")

You should only input one value into an if-condition. Instead, you entered a pandas.Series object as input to the if-clause.

In your case, the pandas.Series only had one number in it. However, in general, pandas.Series contain many values.

The authors of the python pandas library assume that a series contains many numbers, even if it only has one.

The computer thought that you tired to put many different numbers inside of one single if-clause.

The difference between a "function definition" and a "function call"

Your original question was,

"I want to make a function if the common key is found"

Your use of the phrase "make a function" is incorrect. You probably meant, "I want to call a function if a common key is found."

The following are all examples of function "calls":

import pandas as pd
import numpy as np

z = foo(1, 91)

result = funky_function(811, 22, "green eggs and ham")

output = do_stuff()

data = np.random.randn(6, 4)

df = pd.DataFrame(data, index=dates, columns=list("ABCD"))

Suppose that you have two containers.

If you truly want to "make" a function if a common key is found, then you would have code like the following:


dict1 = {'age': 26, 'phone':"303-873-9811"}
dict2 = {'name': "Bob", 'phone':"303-873-9811"}

def foo(dict1, dict2):
    union = set(dict2.keys()).intersection(set(dict1.keys()))
    # if there is a shared key...
    if len(union) > 0:
        # make (create) a new function
        def bar(*args, **kwargs):
            pass
    return bar


new_function = foo(dict1, dict2)

print(new_function)

If you are not using the def keyword, that is known as a function call

In python, you "make" a function ("define" a function) with the def keyword.

I think that your question should be re-titled.

You could write, "How do I call a function if two pandas dataframes have a common key?"

A second good question be something like,

"What went wrong if we see the error message, ValueError: The truth value of a Series is ambiguous.?"

Your question was worded strangely, but I think I can answer it.


Generating Test Data

Your question did not include test data. If you ask a question on stack overflow again, please provide a small example of some test data.

The following is an example of data we can use:

        product_ID  full_price
0  prod_id 1-1-1-1       11.11
1  prod_id 2-2-2-2       22.22
2  prod_id 3-3-3-3       33.33
3  prod_id 4-4-4-4       44.44
4  prod_id 5-5-5-5       55.55
5  prod_id 6-6-6-6       66.66
6  prod_id 7-7-7-7       77.77
------------------------------------------------------------
      product_code  tot_price
0  prod_id 3-3-3-3      34.08
1  prod_id 4-4-4-4      45.19
2  prod_id 5-5-5-5      56.30
3  prod_id 6-6-6-6      67.41
4  prod_id 7-7-7-7      78.52
5  prod_id 8-8-8-8      89.63
6  prod_id 9-9-9-9     100.74
  • Products 1 and 2 are unique to data-frame 1
  • Products 8 and 9 are unique to data-frame 2
  • Both data-frames contain data for products 3, 4, 5, ..., 7.

The prices are slightly different between data-frames.

The test data above is generated by the following code:

import pandas as pd
from copy import copy

raw_data = [
    [
        "prod_id {}-{}-{}-{}".format(k, k, k, k),
        int("{}{}{}{}".format(k, k, k, k))/100
    ] for k in range(1, 10)
]

raw_data = [row for row in raw_data]
df1 = pd.DataFrame(data=copy(raw_data[:-2]), columns=["product_ID", "full_price"])
df2 = pd.DataFrame(data=copy(raw_data[2:]), columns=["product_code", "tot_price"])


for rowid in range(0, len(df2.index)):
    df2.at[rowid, "tot_price"]  = 0.75

print(df1)
print(60*"-")
print(df2)

Add some error checking

It is considered to be "best-practice" to make sure that your function inputs are in the correct format.

You wrote a function named range_calc(z, y). I reccomend making sure that z and y are integers, and not something else (such as a pandas Series object).

def range_calc(z, y):
    try:
       z = float(z)
       y = float(y)
    except ValueError:
        function_name = inspect.stack()[0][3]
        with io.StringIO() as string_stream:
            print(
                  "Error: In "   function_name   "(). Inputs should be
like decimal numbers.",
                  "Instead, we have: "   str(type(y))   " \'"  
repr(str(y))[1:-1]   "\'",
                  file=string_stream,
                  sep="\n"
            )
            err_msg = string_stream.getvalue()
        raise ValueError(err_msg)
    # DO STUFF
    return

Now we get error messages:

import pandas as pd

data = pd.Series([1, 565, 120, 12, 901])

range_calc("I am supposed to be an integer", data)

# ValueError: Error in function range_calc(). Inputs should be like
decimal numbers.
# Instead, we have: <class 'str'> "I am supposed to be an integer"

Code which Accomplishes what you Wanted.

The following is some rather ugly code which computes what you wanted:

# You can continue to use your original `range_calc()` function unmodified

# Use the test data I provided earlier in this answer.  

def foo(df1, df2):
    last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
    df1_ids = set(df1["product_ID"].tolist())
    df2_ids = set(df2["product_code"].tolist())
    pids = df1_ids.intersection(df2_ids) # common_product_ids

    for pid in pids:
        row1 = df1.loc[df1['product_ID'] == pid]
        row2 = df2.loc[df2["product_code"] == pid]
        price1 = row1["full_price"].tolist()[0]
        price2 = row2["tot_price"].tolist()[0]
        price3 = range_calc(price1, price2)
        row3  = pd.DataFrame([[pid, price3]], columns=["Product_number", "Market_Price"])
        last_df = pd.concat([last_df, row3])
    return last_df

# ---------------------------------------

last_df = foo(df1, df2)

The result is:

    Product_number  Market_Price
0  prod_id 6-6-6-6      1.112595
0  prod_id 7-7-7-7      0.955171
0  prod_id 4-4-4-4      1.659659
0  prod_id 5-5-5-5      1.332149
0  prod_id 3-3-3-3      2.200704

Note that one of many reasons that my solution is ugly is in the following line of code:

last_df = pd.concat([last_df, row3])

if last_df is large (thousands of rows), then the code will run very slowly.

This is because instead of inserting a new row of data, we:

  1. copy the original dataframe
  2. append a new row of data to the copy.
  3. delete/destroy the original data-frame.

It is really silly to copy 10,000 rows of data only to add one new value, and then delete the old 10,000 rows.

However, my solution has fewer bugs than your original code, relatively speaking.

  • Related