Home > Mobile >  Pandas extract word from link domain
Pandas extract word from link domain

Time:10-27

I have dataframe :

import pandas as pd    
d = {'domain': ['linkedin.com','aumniversal.tumblr.com','plasticdrea.ms','linkedin.com','s-lw.tumblr.com','newsonline.media','creshendo.co.vu','deadly-skz-gods-cb.tumblr.com','deo.progr.am']}
df = pd.DataFrame(d)
df

I want to extract the words before the last word (for example, before .com, but I have not only .com there). So it will be:

    domain                            words
0   linkedin.com                    linkedin
1   aumniversal.tumblr.com          tumblr
2   plasticdrea.ms                  plasticdrea
3   linkedin.com                    linkedin
4   s-lw.tumblr.com                 tumblr
5   newsonline.media                newsonline
6   creshendo.co.vu                 co
7   deadly-skz-gods-cb.tumblr.com   tumblr
8   deo.progr.am                    progr

CodePudding user response:

Use str.extract

df['words'] = df['domain'].str.extract(r'([^.] )\.[^.]*$')

output:

                          domain        words
0                   linkedin.com     linkedin
1         aumniversal.tumblr.com       tumblr
2                 plasticdrea.ms  plasticdrea
3                   linkedin.com     linkedin
4                s-lw.tumblr.com       tumblr
5               newsonline.media   newsonline
6                creshendo.co.vu           co
7  deadly-skz-gods-cb.tumblr.com       tumblr
8                   deo.progr.am        progr

regex demo

([^.] )   # capture word
\.[^.]*   # followed by .xxx
$         # and end of line

CodePudding user response:

Use Series.str.split and select previous last value by indexing:

df['words'] = df['domain'].str.split('\.').str[-2]
print (df)
                          domain        words
0                   linkedin.com     linkedin
1         aumniversal.tumblr.com       tumblr
2                 plasticdrea.ms  plasticdrea
3                   linkedin.com     linkedin
4                s-lw.tumblr.com       tumblr
5               newsonline.media   newsonline
6                creshendo.co.vu           co
7  deadly-skz-gods-cb.tumblr.com       tumblr
8                   deo.progr.am        progr
  • Related