So i have this dataset below which has some nan values on "a" column. I need to replace only the nan values of column "a" applying a regex on rows of column b and count the number of hashtags on its values. I need to do it inplace since I have a very big dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0, np.nan, np.nan], 'b': ["#hello world", "#hello #world", "hello #world"]})
print(df)
the result should be
df = pd.DataFrame({'a': [0, 2, 1], 'b': ["#hello world", "#hello #world", "hello #world"]})
print(df)
I have already the regex method
regex_hashtag = "#[a-zA-Z0-9_] "
num_hashtags = len(re.findall(regex_hashtag, text))
how can I do it?
CodePudding user response:
Use str.count
:
regex_hashtag = "#[a-zA-Z0-9_] " # or '#\w '
m = df['a'].isna()
df.loc[m, 'a'] = df.loc[m, 'b'].str.count(regex_hashtag)
output:
a b
0 0 #hello world
1 2 #hello #world
2 1 hello #world