Home > OS >  Compare Sets contained in dataframe row wise
Compare Sets contained in dataframe row wise

Time:10-22

I was wondering if you could help me with the next question.

I have a data frame with 2 columns containing sets of strings [Column 1: Strings of reference, column 2: strings to check], my goal is to obtain a new column with the difference between those columns... meaning: the new column should contain the ONLY the strings from Column 2 not present in column 1.

My input:

    import pandas as pd


    data = [['Aa', {'Fatty alcohols', 'Agarofuran sesquiterpenoids', 'Phenylalanine-derived alkaloids', 'Pyridine alkaloids'}, {'Luis', 'Polyamines','Fatty alcohols', 'Agarofuran sesquiterpenoids', 'Phenylalanine-derived alkaloids', 'Pyridine alkaloids'}], ['Bb', {'Agarofuran sesquiterpenoids', 'Lupane triterpenoids', 'Stigmastane steroids'}, {'Agarofuran sesquiterpenoids', 'Lupane triterpenoids', 'Stigmastane steroids', 'Miscellaneous meroterpenoids'}], ['Cc', {'Pyridine alkaloids'}, {'Luis', 'Pyridine alkaloids'}], ['Dd', {'Luis', 'Polyamines'}, {'Marco'}], ['Ee', {'Friedelane triterpenoids', 'Cucurbitane triterpenoids'}, {'Friedelane triterpenoids', 'Cucurbitane triterpenoids', 'Ansa macrolides'}]]
df = pd.DataFrame(data, columns=["Species", "Reference", "To_check"])
df

What I want to get is:

 data = [['Aa', {'Fatty alcohols', 'Agarofuran sesquiterpenoids', 'Phenylalanine-derived alkaloids', 'Pyridine alkaloids'}, {'Luis', 'Polyamines','Fatty alcohols', 'Agarofuran sesquiterpenoids', 'Phenylalanine-derived alkaloids', 'Pyridine alkaloids'}, {'Luis', 'Polyamines'}], ['Bb', {'Agarofuran sesquiterpenoids', 'Lupane triterpenoids', 'Stigmastane steroids'}, {'Agarofuran sesquiterpenoids', 'Lupane triterpenoids', 'Stigmastane steroids', 'Miscellaneous meroterpenoids'}, {'Miscellaneous meroterpenoids'}], ['Cc', {'Pyridine alkaloids'}, {'Luis', 'Pyridine alkaloids'}, {'Luis'}], ['Dd', {'Luis', 'Polyamines'}, {'Marco'}, {'Marco'}], ['Ee', {'Friedelane triterpenoids', 'Cucurbitane triterpenoids'}, {'Friedelane triterpenoids', 'Cucurbitane triterpenoids', 'Ansa macrolides'}, {'Ansa macrolides'}]]

df_out = pd.DataFrame(data, columns=["Species", "Reference", "To_check", 'New']) df_out

So far, what I tried was this but is not exactly what I want, and here I ask your help...

df_diff = df[~df['To_check'].isin(df['Reference'])]
df_diff

This is giving me a reduced dataframe with the rows having a difference, if I try to create a new column with de difference I got an error...

df['New'] = df[~df['To_check'].isin(df['Reference'])]
ValueError: Wrong number of items passed 3, placement implies 1

So far, I used sets to contain the strings in the columns, but I guess that could be the same if I use lists.

So, how can I get that new column with the results? and I'm also wondering if isin() do the comparison row wisely, or is there another method more appropriate.

I NEED TO KEEP THE DATAFRAME AS IT IS... just add the new column

Merci!!

CodePudding user response:

Use:

df["New"] = df["To_check"] - df["Reference"]
print(df)

Output

  Species     Reference      To_check        New
0      Aa  {B, A, C, D}  {A, C, D, E}        {E}
1      Bb  {B, A, C, D}  {B, A, C, D}         {}
2      Cc  {F, A, G, E}  {B, A, D, E}     {B, D}
3      Dd        {A, G}  {A, C, D, E}  {D, C, E}
4      Ee        {D, C}        {D, E}        {E}

The expression:

df["To_check"] - df["Reference"]

applies the element-wise difference between values of df["To_check"] and df["Reference"]. Something similar to:

# notice that this could be an alternative solution
df["New"] = [check - reference for check, reference in zip(df["To_check"], df["Reference"])]

The difference between set values is (from the documentation):

Return a new set with elements in the set that are not in the others.

  • Related