There's a few different ways of using Regex for hashtags that I've been able to find:
(#[a-z0-9_] )
(# [a-zA-Z0-9(_)]{1,})
I have some data where there might be too many hashtags or @ symbols present. Really simply: if I had more than 5 hashtags present in a String, I'd like to be able to drop the row in Pandas. I thought it wouldn't be that hard, maybe something like (#[A-z0-9_] ){5,}
but that doesn't work. Is this possible with Regex?
CodePudding user response:
You can try pandas.Series.str.count
to to count occurrences of pattern in each string of the Series.
out = df[df['col'].str.count('#[a-z0-9_] ').le(5)]
import pandas as pd
df = pd.DataFrame({'hashtag': ['5', '6', '7', '8', '9', '#1#2#3#4#5#6']})
out = df[df['hashtag'].str.count('#[a-z0-9_] ').le(5)]
print(out)
hashtag
0 5
1 6
2 7
3 8
4 9
CodePudding user response:
You can count
the values and use it for boolean indexing:
N = 5
df[df['col'].str.count('#[a-z0-9_] ').le(N)]
example:
# input
col
0 #abc blah #def #ghi blah #jk_1 blah #lmn
1 #abc blah #def #ghi blah #jk_1 blah #lmn #opq
2 #abc #def
# output
col
0 #abc blah #def #ghi blah #jk_1 blah #lmn
2 #abc #def
Used input:
df = pd.DataFrame({'col': ['#abc blah #def #ghi blah #jk_1 blah #lmn ',
'#abc blah #def #ghi blah #jk_1 blah #lmn #opq',
'#abc #def']})