Home > Blockchain >  using all() to compare two dataframe columns with lists of strings
using all() to compare two dataframe columns with lists of strings

Time:04-22

I'm working with a dataframe with two columns with lists of strings, and I need to know if all elements of a list are contained in the other list.

Initially my values were strings, here's an example:

df1
       num
0     [10 2]
1     [120]
2     [2 5 8]
df2
       num
0     [10 2]
1     [60]
2     [2 5]

Then I used df1['num'].str.split() to get the elements in the string into a list:

df1
       num
0     [10, 2]
1     [120]
2     [2, 5, 8]

After that I tried using all(item in df1['num'].str.split() for item in df2['num'].str.split()) but it outputs:

TypeError: unhashable type: 'list'

The desirable output would be:

0     True
1     False
2     True

How can I do this?

CodePudding user response:

import pandas as pd

df1 = pd.DataFrame({
    'num': [['10', '2'], ['120'], ['2', '5', '8']]
})
df2 = pd.DataFrame({
    'num': [['10', '2'], ['60'], ['2', '5']]
})

df1_str = pd.DataFrame(df1['num'].str)
df2_str = pd.DataFrame(df2['num'].str)

lst = [all(df2_str[col].isin(df1_str[col])) for col in df2_str.columns]
print(lst)

CodePudding user response:

You can use set operations here:

pd.Series([set(a)>=set(b) for a,b in zip(df1['num'], df2['num'])], index=df1.index)

output:

0     True
1    False
2     True
dtype: bool

Or to assign to one of the dataframes:

df1['test'] = [set(a)>=set(b) for a,b in zip(df1['num'], df2['num'])]

output:

         num   test
0    [10, 2]   True
1      [120]  False
2  [2, 5, 8]   True

CodePudding user response:

Use issubset method with convert values from df1['num'] to sets:

df1['new'] = [set(b).issubset(a) for a,b in zip(df1['num'], df2['num'])]
print (df1)
         num    new
0    [10, 2]   True
1      [120]  False
2  [2, 5, 8]   True

If values are not splitted modify solution by:

df1['test'] = [set(b.split()).issubset(a.split()) for a,b in zip(df1['num'], df2['num'])]
  • Related