Cannot match two values in two different csvs-CodePudding

I am parsing through two separate csv files with the goal of finding matching customerID's and dates to manipulate balance.

In my for loop, at some point there should be a match as I intentionally put duplicate ID's and dates in my csv. However, when parsing and attempting to match data, the matches aren't working properly even though the values are the same.

main.py:

transactions = pd.read_csv(INPUT_PATH, delimiter=',')
accounts = pd.DataFrame(
    columns=['customerID', 'MM/YYYY', 'minBalance', 'maxBalance', 'endingBalance'])

for index, row in transactions.iterrows():
    customer_id = row['customerID']
    date = formatter.convert_date(row['date'])

    minBalance = 0
    maxBalance = 0
    endingBalance = 0

    dict = {
        "customerID": customer_id,
        "MM/YYYY": date,
        "minBalance": minBalance,
        "maxBalance": maxBalance,
        "endingBalance": endingBalance
    }

    print(customer_id in accounts['customerID'] and date in accounts['MM/YYYY'])
    # Returns False

    if (accounts['customerID'].equals(customer_id)) and (accounts['MM/YYYY'].equals(date)):
        # This section never runs
        print("hello")

    else:
        print("world")
        accounts.loc[index] = dict
        accounts.to_csv(OUTPUT_PATH, index=False)

Transactions CSV:

customerID,date,amount
1,12/21/2022,500
1,12/21/2022,-300
1,12/22/2022,100
1,01/01/2023,250
1,01/01/2022,300
1,01/01/2022,-500
2,12/21/2022,-200
2,12/21/2022,700
2,12/22/2022,200
2,01/01/2023,300
2,01/01/2023,400
2,01/01/2023,-700

Accounts CSV

customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,12/2022,0,0,0
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0

Expected Accounts CSV

customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0

CodePudding user response：

It is not clear from the information what does formatter.convert_date function does. but from the example CSVs you added it seems like it should do something like:

def convert_date(mmddyy):
  (mm,dd,yy) = mmddyy.split('/')
  return mm   '/'   yy

in addition, make sure that data types are also equal (both date fields are strings and also for customer id)

CodePudding user response：

Where does the problem come from

Your Problem comes from the comparison you're doing with pandas Series, to make it simple, when you do :

customer_id in accounts['customerID']

You're checking if customer_id is an index of the Series accounts['customerID'], however, you want to check the value of the Series.

And in your if statement, you're using the pd.Series.equals method. Here is an explanation of what does the method do from the documentation

This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

So equals is used to compare between DataFrames and Series, which is different from what you're trying to do.

One of many solutions

There are multiple ways to achieve what you're trying to do, the easiest is simply to get the values from the series before doing the comparison :

customer_id in accounts['customerID'].values

Note that accounts['customerID'].values returns a NumPy array of the values of your Series.

So your comparison should be something like this :

print(customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values)

And use the same thing in your if statement :

if (customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values):

Alternative solutions

You can also use the pandas.Series.isin function that given an element as input return a boolean Series showing whether each element in the Series matches the given input, then you will just need to check if the boolean Series contain one True value.

Documentation of isin : https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html