I have dataframe :
import pandas as pd
d = {'domain': ['linkedin.com','aumniversal.tumblr.com','plasticdrea.ms','linkedin.com','s-lw.tumblr.com','newsonline.media','creshendo.co.vu','deadly-skz-gods-cb.tumblr.com','deo.progr.am']}
df = pd.DataFrame(d)
df
I want to extract the words before the last word (for example, before .com, but I have not only .com there). So it will be:
domain words
0 linkedin.com linkedin
1 aumniversal.tumblr.com tumblr
2 plasticdrea.ms plasticdrea
3 linkedin.com linkedin
4 s-lw.tumblr.com tumblr
5 newsonline.media newsonline
6 creshendo.co.vu co
7 deadly-skz-gods-cb.tumblr.com tumblr
8 deo.progr.am progr
CodePudding user response:
Use str.extract
df['words'] = df['domain'].str.extract(r'([^.] )\.[^.]*$')
output:
domain words
0 linkedin.com linkedin
1 aumniversal.tumblr.com tumblr
2 plasticdrea.ms plasticdrea
3 linkedin.com linkedin
4 s-lw.tumblr.com tumblr
5 newsonline.media newsonline
6 creshendo.co.vu co
7 deadly-skz-gods-cb.tumblr.com tumblr
8 deo.progr.am progr
([^.] ) # capture word
\.[^.]* # followed by .xxx
$ # and end of line
CodePudding user response:
Use Series.str.split
and select previous last value by indexing:
df['words'] = df['domain'].str.split('\.').str[-2]
print (df)
domain words
0 linkedin.com linkedin
1 aumniversal.tumblr.com tumblr
2 plasticdrea.ms plasticdrea
3 linkedin.com linkedin
4 s-lw.tumblr.com tumblr
5 newsonline.media newsonline
6 creshendo.co.vu co
7 deadly-skz-gods-cb.tumblr.com tumblr
8 deo.progr.am progr