Home > Blockchain >  Regex for Hashtag but only return true if 5 or more hashtags are in the String
Regex for Hashtag but only return true if 5 or more hashtags are in the String

Time:04-22

There's a few different ways of using Regex for hashtags that I've been able to find:

  • (#[a-z0-9_] )
  • (# [a-zA-Z0-9(_)]{1,})

I have some data where there might be too many hashtags or @ symbols present. Really simply: if I had more than 5 hashtags present in a String, I'd like to be able to drop the row in Pandas. I thought it wouldn't be that hard, maybe something like (#[A-z0-9_] ){5,} but that doesn't work. Is this possible with Regex?

CodePudding user response:

You can try pandas.Series.str.count to to count occurrences of pattern in each string of the Series.

out = df[df['col'].str.count('#[a-z0-9_] ').le(5)]
import pandas as pd


df = pd.DataFrame({'hashtag': ['5', '6', '7', '8', '9', '#1#2#3#4#5#6']})

out = df[df['hashtag'].str.count('#[a-z0-9_] ').le(5)]
print(out)

  hashtag
0       5
1       6
2       7
3       8
4       9

CodePudding user response:

You can count the values and use it for boolean indexing:

N = 5
df[df['col'].str.count('#[a-z0-9_] ').le(N)]

example:

# input
                                             col
0  #abc blah #def #ghi blah #jk_1 blah #lmn     
1  #abc blah #def #ghi blah #jk_1 blah #lmn #opq
2  #abc #def                                    

# output
                                         col
0  #abc blah #def #ghi blah #jk_1 blah #lmn 
2  #abc #def                                

Used input:

df = pd.DataFrame({'col': ['#abc blah #def #ghi blah #jk_1 blah #lmn ',
                           '#abc blah #def #ghi blah #jk_1 blah #lmn #opq',
                           '#abc #def']})
  • Related