How to count tokens in pandas without any white spaces-CodePudding

I have following dataframe in Pandas

    publish_date    headline_text
    20030219        aba decides against community broadcasting 
    20030219        act fire witnesses must be aware of defamation
    20030219        a g calls for infrastructure protection summit
    20030219        air nz staff in aust strike for pay rise
    20030219        air nz strike to affect australian travellers

I want to add one more column where Count of Tokens should get displayed without white spaces.

I am doing following, but it gives me count of tokens with white space.

nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.len()

CodePudding user response：

You could always remove the whitespace before taking the length:

>>> nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.replace('\s', '', regex=True).str.len()
>>> nlp_df
   publish_date                                   headline_text  count_of_tokens
0      20030219      aba decides against community broadcasting               38
1      20030219  act fire witnesses must be aware of defamation               39
2      20030219  a g calls for infrastructure protection summit               40
3      20030219        air nz staff in aust strike for pay rise               32
4      20030219   air nz strike to affect australian travellers               39

Or remove the number of whitespace from the total length:

>>> nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.len() - nlp_df['headline_text'].str.count('\s')
>>> nlp_df
   publish_date                                   headline_text  count_of_tokens
0      20030219      aba decides against community broadcasting               38
1      20030219  act fire witnesses must be aware of defamation               39
2      20030219  a g calls for infrastructure protection summit               40
3      20030219        air nz staff in aust strike for pay rise               32
4      20030219   air nz strike to affect australian travellers               39

\s is the regex class to match any whitespace character. See doc:

Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.

Of course you can also use the non-whitespace class \S as suggested by @mozway, which is even simpler:

>>> nlp_df['count_of_tokens'] = nlp_df['headline_text'].str.count('\S')
>>> nlp_df
   publish_date                                   headline_text  count_of_tokens
0      20030219      aba decides against community broadcasting               38
1      20030219  act fire witnesses must be aware of defamation               39
2      20030219  a g calls for infrastructure protection summit               40
3      20030219        air nz staff in aust strike for pay rise               32
4      20030219   air nz strike to affect australian travellers               39

CodePudding user response：

IIUC, you want to count the words? or the non-space characters?

counting the words:

You can count the non-spaced patterns:

df['words'] = df['headline_text'].str.count('\S ')

or, split the string and get the list length:

df['words'] = df['headline_text'].str.split('\s ').apply(len)

or, count the separators and add 1:

df['words'] = df['headline_text'].str.count('\s ').add(1)

counting the letters:

df['letters'] = df['headline_text'].str.count('\S')

output:

   publish_date                                   headline_text  words  letters
0      20030219      aba decides against community broadcasting      5       38
1      20030219  act fire witnesses must be aware of defamation      8       39
2      20030219  a g calls for infrastructure protection summit      7       40
3      20030219        air nz staff in aust strike for pay rise      9       32
4      20030219   air nz strike to affect australian travellers      7       39