Home > Mobile >  Searching for strings in lists inside Pandas DataFrame
Searching for strings in lists inside Pandas DataFrame

Time:11-26

I'm trying to search for strings within lists that are contained in a pandas dataframe, see this one example:

       userAuthor     hashtagsMessage
post_1    nytimes            [#Emmys]
post_2        TMZ                  []
post_3     Forbes        [#BTSatUNGA]
post_4    nytimes            [#Emmys]
post_5     Forbes  [#BTS, #BTSatUNGA]

As you have noticed, the column that hosts such lists is 'hashtagsMessage'. I've tried using conventional methods for string searching but I've not been able to.

If I wanted to look for an exact match for '#BTS', with a conventional method, you could use some of these options, like:

df['hashtagsMessage'].str.contains("#BTS", case=False)

or

df['hashtagsMessage']=="#BTS" 

Or similar. Unfortunately, these approaches do not work for lists, I have to make an extra step I suppose to index inside the list while I'm searching in the DataFrame but I'm not really sure how to do this part.

Any help is entirely appreciated!

CodePudding user response:

Use map or apply:

>>> df['hashtagsMessage'].map(lambda x: '#BTS' in x)

post_1    False
post_2    False
post_3    False
post_4    False
post_5     True
Name: hashtagsMessage, dtype: bool

Update

A more vectorizable way using explode:

>>> df.loc[df['hashtagsMessage'].explode().eq('#BTS').loc[lambda x: x].index]

       userAuthor     hashtagsMessage
post_5     Forbes  [#BTS, #BTSatUNGA]

CodePudding user response:

Please search for raw string

if not actual list use:

df['hashtagsMessage'].str.contains(r'#BTS')

if list please use

df['hashtagsMessage'].astype(str).str.contains(r'#BTS')

CodePudding user response:

You could use a simple anonymous function employing a list-comprehension and any() e.g.:

Edit: I originally presumed you wanted any tag containing '#BTS', and just edited to find only exact match(es) :)

In [10]: df = pd.DataFrame({'hashtagsMessage':[
                            [], ["#BTSatUNGA"],
                            ["#Emmys"], ['#BTS', '#BTSatUNGA']]})

In [18]: df['hashtagsMessage'].apply(lambda lst: any(s for s in lst
                                                     if s == "#BTS"))
Out[18]: 
0    False
1    False
2    False
3     True
Name: hashtagsMessage, dtype: bool
  • Related