removing certain data from a pandas column-CodePudding

I have a pandas dataFrame column where I am looking to remove certain shared information from another pandas Dataframe and produce a dataset that doesn't contain either of the shared data. I have two databases and they each share the same data in some columns. I tried to take the data from one column and say if it doesn't match the data in the other pandas dataFrame, then produce a new dataset without the shared information. For example:

               letsee = db7[db7['ISBN'] != 'stop']

DB7 contains data that looks like this:

        ISBN         retail   discount
      9780060084608   16.99      .4
      9780060520762   23.99      .6

and stop contains data that looks like this

       ISBN           retail     discount
     9780060084608       15.99     .6
     9780060548780      21.99       .3

db7 and stop are two pandas DataFrames. I thought that if the ISBN in db7 doesn't match the database in stop, then it would produce a dataset "letsee" that doesn't contain either of the shared values that are present in both dataFrames. The shared information is still showing up in the letsee dataFrame. How would I make a dataFrame that doesn't contain the shared information present in both DataFrames.

Basically, since the retail and/or discount price has changed from the old file (db7) and the new file (stop) has the new price/discount, I need to remove the information from the original database and have a separate dataset for the information which has changed. The ISBN is used as the identifier that I merged on and combine the two dataFrames. However, I want to have a Dataframe that doesn't contain the shared ISBN. I have tried to update the ISBN information in the original file by replacing the old price with the new price, but I have not been able to (don't know how. So, this step was just to remove the shared ISBN identifer and make a separate file for it.

Both datasets have different ISBN and some shared; I want to identify the shared ISBN and delete them from the entire dataset.

Or another solution would be this: notice that the first row of both dataFrames have the same ISBN. How could I take the stop ISBN information (price and discount) and substitute it and replace it in db7's price and discount? Either that or I have to strip the shared ISBN from the dataFrame and make a separate file for it (which reflects the new retail and discount price).

CodePudding user response：

Try this:

letsee = db7[db7['ISBN'].isin(stop['ISBN'].values)]

CodePudding user response：

You can concatenate the two dataframes and then drop duplicates along the ISBN column.

import pandas as pd
db7 = pd.DataFrame({"ISBN":[9780060084608, 9780060520762], "retail":[16.99, 23.99], "discount": [.4,.6]})
stop = pd.DataFrame({"ISBN":[9780060084608, 9780060548780], "retail":[15.99, 21.99], "discount": [.6,.3]})
print(db7)
print(stop)

Output:

         ISBN     retail  discount
0  9780060084608   16.99       0.4
1  9780060520762   23.99       0.6
            ISBN  retail  discount
0  9780060084608   15.99       0.6
1  9780060548780   21.99       0.3

Now concatenate using pd.concat and drop duplicates along the ISBN column. Make sure to reset the index.

letsee = pd.concat([db7,stop]).drop_duplicates(subset="ISBN", keep=False).reset_index(drop=True)
print(letsee)

Output:

    ISBN            retail  discount
0   9780060520762   23.99   0.6
1   9780060548780   21.99   0.3

CodePudding user response：

You can do this using ~isin ("not is in"). If you have two DataFrames, like this

import pandas as pd

db7 = pd.DataFrame({"ISBN":[123,456,789],"discount":[0.6,0.3,0.4]})
stop = pd.DataFrame({"ISBN":[123,555],"discount":[0.5,0.75]})

print(db7)
print(stop)

   ISBN  discount
0   123       0.6
1   456       0.3
2   789       0.4
   ISBN  discount
0   123      0.50
1   555      0.75

and you want to create a new DataFrame with only ISBN entries from the first one that aren't in the second one, you can do

letsee = db7[~db7["ISBN"].isin(stop["ISBN"])]
print(letsee)

which gives the expected result (excluded 123 because it was in stop):

   ISBN  discount
1   456       0.3
2   789       0.4

If you then want to combine this with the stop DataFrame you can use concat

merged = pd.concat((letsee, stop), ignore_index=True)
print(merged)

which gives you the combination of db7 and stop, and uses the values from stop where there were duplicates

   ISBN  discount
0   456      0.30
1   789      0.40
2   123      0.50
3   555      0.75