I have following dataframe in Pandas
publish_date headline_text
20030219 aba decides against community broadcasting
20030219 act fire witnesses must be aware of defamation
20030219 a g calls for infrastructure protection summit
20030219 air nz staff in aust strike for pay rise
20030219 air nz strike to affect australian travellers
I want to add one more column where Count of Tokens should get displayed without white spaces.
I am doing following, but it gives me count of tokens with white space.
nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.len()
CodePudding user response:
You could always remove the whitespace before taking the length:
>>> nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.replace('\s', '', regex=True).str.len()
>>> nlp_df
publish_date headline_text count_of_tokens
0 20030219 aba decides against community broadcasting 38
1 20030219 act fire witnesses must be aware of defamation 39
2 20030219 a g calls for infrastructure protection summit 40
3 20030219 air nz staff in aust strike for pay rise 32
4 20030219 air nz strike to affect australian travellers 39
Or remove the number of whitespace from the total length:
>>> nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.len() - nlp_df['headline_text'].str.count('\s')
>>> nlp_df
publish_date headline_text count_of_tokens
0 20030219 aba decides against community broadcasting 38
1 20030219 act fire witnesses must be aware of defamation 39
2 20030219 a g calls for infrastructure protection summit 40
3 20030219 air nz staff in aust strike for pay rise 32
4 20030219 air nz strike to affect australian travellers 39
\s
is the regex class to match any whitespace character. See doc:
Matches Unicode whitespace characters (which includes
[ \t\n\r\f\v]
, and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only[ \t\n\r\f\v]
is matched.
Of course you can also use the non-whitespace class \S
as suggested by @mozway, which is even simpler:
>>> nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.count('\S')
>>> nlp_df
publish_date headline_text count_of_tokens
0 20030219 aba decides against community broadcasting 38
1 20030219 act fire witnesses must be aware of defamation 39
2 20030219 a g calls for infrastructure protection summit 40
3 20030219 air nz staff in aust strike for pay rise 32
4 20030219 air nz strike to affect australian travellers 39
CodePudding user response:
IIUC, you want to count the words? or the non-space characters?
counting the words:
You can count the non-spaced patterns:
df['words'] = df['headline_text'].str.count('\S ')
or, split the string and get the list length:
df['words'] = df['headline_text'].str.split('\s ').apply(len)
or, count the separators and add 1:
df['words'] = df['headline_text'].str.count('\s ').add(1)
counting the letters:
df['letters'] = df['headline_text'].str.count('\S')
output:
publish_date headline_text words letters
0 20030219 aba decides against community broadcasting 5 38
1 20030219 act fire witnesses must be aware of defamation 8 39
2 20030219 a g calls for infrastructure protection summit 7 40
3 20030219 air nz staff in aust strike for pay rise 9 32
4 20030219 air nz strike to affect australian travellers 7 39