Home > Software design >  Capitalizing every first word after a period in a Pandas column
Capitalizing every first word after a period in a Pandas column

Time:07-29

I'm trying to capitalize the first letter (and ONLY the first one) of a new sentence in some body text stored in a Pandas DF.

Example: my dataframe has a Description column which may contain text like:

This product has several different features. it is also VERY cost effective. it is one of my favorite products.

I want my result to look like:

This product has several different features. It is also very cost effective. It is one of my favorite products.

.capitalize() doesn't work for me because it leaves new sentences in the same body text with lowercase (that is, whatever comes after a dot and a space ". ".

Any thoughts on how I can achieve this without iterating through the rows manually?

Thanks for your time,

CodePudding user response:

re.findIter will return all the matches of a regex (in our case the .)

and you can just use to lower before it.

example (may not work as is didn't have an IDE handy):

mystring = "SOOOme wEirdly capiTalised STRINg. Followed By CHARACTERS"
mystring = myString.lower()
matches = re.finditer('[.]')
for match in matches:
  mystring[match.pos] = mystring[match.pos].upper()

CodePudding user response:

Assuming that all your sentences end/start with dot and space characters: ". " you can use split and join together with capitalize:

import pandas as pd
data = {'index' :[1, 2], "description": ["This product has several different features. it is also VERY cost effective. it is one of my favorite products.", "test sentence. another SENTENCE"]}
df = pd.DataFrame(data)

df["description"].apply(lambda x: ". ".join([sentence.capitalize() for sentence in x.lower().split(". ")]))

If you would want to cover more complex cases, then you can use nltk or spacy tokenizers to split the sentences.

CodePudding user response:

Use a regex to get the sentences and map capitalize on them using str.replace:

df['capitalize'] = df["description"].str.replace(r'[a-zA-Z][^.] ', lambda m: m.group().capitalize(), regex=True)

Example output:

              description              capitalize
0  Abc def. gHi JKL. mno.  Abc def. Ghi jkl. Mno.

Used input:

import pandas as pd
data = {"description": ["Abc def. gHi JKL. mno."]}
df = pd.DataFrame(data)

Regex:

[a-zA-Z]  # match a letter
[^.]      # match anything but a period
  • Related