Home > Blockchain >  How to remove characters after last period in df column python?
How to remove characters after last period in df column python?

Time:12-23

So I have a df that has a column full of domains. So for example I have records like this

common_name
www.amazon.com
amazon.com 
subexample.amazon.com
walmart.en
walmart.uk
michigan.edu

I want to use python to extract anything before the last . but before the 1st period if there is one. So the results would look like this.

common_name
amazon
amazon
amazon 
walmart
walmart
michigan

I found some examples of this here but it looks like it was an operator on a string and it was anything before the certain character not between them. The string operator may take awhile to run, so wondering if there is a function using pandas on the whole df by chance?

CodePudding user response:

This should work:

df['col'] = df['col'].str.rsplit('.', n=1).str[0].str.split('.').str[-1]

Output:

>>> df
           col
0  common_name
1       amazon
2       amazon
3       amazon
4      walmart
5      walmart
6     michigan

CodePudding user response:

You could use pd.DataFrame.apply along with a lambda function that returns the longest element after splitting (based on comment in richardec's answer):

In [1]: import pandas as pd
In [2]: d = {
   ...:     'domains': [
   ...:         'common_name',
   ...:         'www.amazon.com',
   ...:         'amazon.com',
   ...:         'subexample.amazon.com',
   ...:         'walmart.en',
   ...:         'walmart.uk',
   ...:         'michigan.edu',
   ...:         'tkoutletstore.co.uk',
   ...:         'tillyandotto.com.au',
   ...:     ]
   ...: }
   ...: df = pd.DataFrame(data=d)
   ...: df
Out[2]: 
                 domains
0            common_name
1         www.amazon.com
2             amazon.com
3  subexample.amazon.com
4             walmart.en
5             walmart.uk
6           michigan.edu
7    tkoutletstore.co.uk
8    tillyandotto.com.au
In [3]: df['extracted'] = df['domains'].apply(lambda d: max(d.split('.'), key=len))

In [4]: df
Out[4]: 
                 domains      extracted
0            common_name    common_name
1         www.amazon.com         amazon
2             amazon.com         amazon
3  subexample.amazon.com     subexample
4             walmart.en        walmart
5             walmart.uk        walmart
6           michigan.edu       michigan
7    tkoutletstore.co.uk  tkoutletstore
8    tillyandotto.com.au   tillyandotto

CodePudding user response:

Pandas won't make things any faster computation-wise. This regex might work for you:

s.str.extract(r'(\w )(\.\w{2,3}) $')[0]

But a better solution would be: Extract domain from URL in python

  • Related