I have a pandas dataFrame column where I am looking to remove certain shared information from another pandas Dataframe and produce a dataset that doesn't contain either of the shared data. I have two databases and they each share the same data in some columns. I tried to take the data from one column and say if it doesn't match the data in the other pandas dataFrame, then produce a new dataset without the shared information. For example:
letsee = db7[db7['ISBN'] != 'stop']
DB7 contains data that looks like this:
ISBN retail discount
9780060084608 16.99 .4
9780060520762 23.99 .6
and stop contains data that looks like this
ISBN retail discount
9780060084608 15.99 .6
9780060548780 21.99 .3
db7 and stop are two pandas DataFrames. I thought that if the ISBN in db7 doesn't match the database in stop, then it would produce a dataset "letsee" that doesn't contain either of the shared values that are present in both dataFrames. The shared information is still showing up in the letsee dataFrame. How would I make a dataFrame that doesn't contain the shared information present in both DataFrames.
Basically, since the retail and/or discount price has changed from the old file (db7) and the new file (stop) has the new price/discount, I need to remove the information from the original database and have a separate dataset for the information which has changed. The ISBN is used as the identifier that I merged on and combine the two dataFrames. However, I want to have a Dataframe that doesn't contain the shared ISBN. I have tried to update the ISBN information in the original file by replacing the old price with the new price, but I have not been able to (don't know how. So, this step was just to remove the shared ISBN identifer and make a separate file for it.
Both datasets have different ISBN and some shared; I want to identify the shared ISBN and delete them from the entire dataset.
Or another solution would be this: notice that the first row of both dataFrames have the same ISBN. How could I take the stop ISBN information (price and discount) and substitute it and replace it in db7's price and discount? Either that or I have to strip the shared ISBN from the dataFrame and make a separate file for it (which reflects the new retail and discount price).
CodePudding user response:
Try this:
letsee = db7[db7['ISBN'].isin(stop['ISBN'].values)]
CodePudding user response:
You can concatenate the two dataframes and then drop duplicates along the ISBN
column.
import pandas as pd
db7 = pd.DataFrame({"ISBN":[9780060084608, 9780060520762], "retail":[16.99, 23.99], "discount": [.4,.6]})
stop = pd.DataFrame({"ISBN":[9780060084608, 9780060548780], "retail":[15.99, 21.99], "discount": [.6,.3]})
print(db7)
print(stop)
Output:
ISBN retail discount
0 9780060084608 16.99 0.4
1 9780060520762 23.99 0.6
ISBN retail discount
0 9780060084608 15.99 0.6
1 9780060548780 21.99 0.3
Now concatenate using pd.concat
and drop duplicates along the ISBN
column. Make sure to reset the index.
letsee = pd.concat([db7,stop]).drop_duplicates(subset="ISBN", keep=False).reset_index(drop=True)
print(letsee)
Output:
ISBN retail discount
0 9780060520762 23.99 0.6
1 9780060548780 21.99 0.3
CodePudding user response:
You can do this using ~isin
("not is in"). If you have two DataFrames, like this
import pandas as pd
db7 = pd.DataFrame({"ISBN":[123,456,789],"discount":[0.6,0.3,0.4]})
stop = pd.DataFrame({"ISBN":[123,555],"discount":[0.5,0.75]})
print(db7)
print(stop)
ISBN discount
0 123 0.6
1 456 0.3
2 789 0.4
ISBN discount
0 123 0.50
1 555 0.75
and you want to create a new DataFrame with only ISBN entries from the first one that aren't in the second one, you can do
letsee = db7[~db7["ISBN"].isin(stop["ISBN"])]
print(letsee)
which gives the expected result (excluded 123
because it was in stop
):
ISBN discount
1 456 0.3
2 789 0.4
If you then want to combine this with the stop
DataFrame you can use concat
merged = pd.concat((letsee, stop), ignore_index=True)
print(merged)
which gives you the combination of db7
and stop
, and uses the values from stop
where there were duplicates
ISBN discount
0 456 0.30
1 789 0.40
2 123 0.50
3 555 0.75