I was wondering if you could help me with the next question.
I have a data frame with 2 columns containing sets of strings [Column 1: Strings of reference, column 2: strings to check], my goal is to obtain a new column with the difference between those columns... meaning: the new column should contain the ONLY the strings from Column 2 not present in column 1.
My input:
import pandas as pd
data = [['Aa', {'Fatty alcohols', 'Agarofuran sesquiterpenoids', 'Phenylalanine-derived alkaloids', 'Pyridine alkaloids'}, {'Luis', 'Polyamines','Fatty alcohols', 'Agarofuran sesquiterpenoids', 'Phenylalanine-derived alkaloids', 'Pyridine alkaloids'}], ['Bb', {'Agarofuran sesquiterpenoids', 'Lupane triterpenoids', 'Stigmastane steroids'}, {'Agarofuran sesquiterpenoids', 'Lupane triterpenoids', 'Stigmastane steroids', 'Miscellaneous meroterpenoids'}], ['Cc', {'Pyridine alkaloids'}, {'Luis', 'Pyridine alkaloids'}], ['Dd', {'Luis', 'Polyamines'}, {'Marco'}], ['Ee', {'Friedelane triterpenoids', 'Cucurbitane triterpenoids'}, {'Friedelane triterpenoids', 'Cucurbitane triterpenoids', 'Ansa macrolides'}]]
df = pd.DataFrame(data, columns=["Species", "Reference", "To_check"])
df
What I want to get is:
data = [['Aa', {'Fatty alcohols', 'Agarofuran sesquiterpenoids', 'Phenylalanine-derived alkaloids', 'Pyridine alkaloids'}, {'Luis', 'Polyamines','Fatty alcohols', 'Agarofuran sesquiterpenoids', 'Phenylalanine-derived alkaloids', 'Pyridine alkaloids'}, {'Luis', 'Polyamines'}], ['Bb', {'Agarofuran sesquiterpenoids', 'Lupane triterpenoids', 'Stigmastane steroids'}, {'Agarofuran sesquiterpenoids', 'Lupane triterpenoids', 'Stigmastane steroids', 'Miscellaneous meroterpenoids'}, {'Miscellaneous meroterpenoids'}], ['Cc', {'Pyridine alkaloids'}, {'Luis', 'Pyridine alkaloids'}, {'Luis'}], ['Dd', {'Luis', 'Polyamines'}, {'Marco'}, {'Marco'}], ['Ee', {'Friedelane triterpenoids', 'Cucurbitane triterpenoids'}, {'Friedelane triterpenoids', 'Cucurbitane triterpenoids', 'Ansa macrolides'}, {'Ansa macrolides'}]]
df_out = pd.DataFrame(data, columns=["Species", "Reference", "To_check", 'New']) df_out
So far, what I tried was this but is not exactly what I want, and here I ask your help...
df_diff = df[~df['To_check'].isin(df['Reference'])]
df_diff
This is giving me a reduced dataframe with the rows having a difference, if I try to create a new column with de difference I got an error...
df['New'] = df[~df['To_check'].isin(df['Reference'])]
ValueError: Wrong number of items passed 3, placement implies 1
So far, I used sets to contain the strings in the columns, but I guess that could be the same if I use lists.
So, how can I get that new column with the results? and I'm also wondering if isin() do the comparison row wisely, or is there another method more appropriate.
I NEED TO KEEP THE DATAFRAME AS IT IS... just add the new column
Merci!!
CodePudding user response:
Use:
df["New"] = df["To_check"] - df["Reference"]
print(df)
Output
Species Reference To_check New
0 Aa {B, A, C, D} {A, C, D, E} {E}
1 Bb {B, A, C, D} {B, A, C, D} {}
2 Cc {F, A, G, E} {B, A, D, E} {B, D}
3 Dd {A, G} {A, C, D, E} {D, C, E}
4 Ee {D, C} {D, E} {E}
The expression:
df["To_check"] - df["Reference"]
applies the element-wise difference between values of df["To_check"]
and df["Reference"]
. Something similar to:
# notice that this could be an alternative solution
df["New"] = [check - reference for check, reference in zip(df["To_check"], df["Reference"])]
The difference between set values is (from the documentation):
Return a new set with elements in the set that are not in the others.