So I have a df that has a column full of domains. So for example I have records like this
common_name
www.amazon.com
amazon.com
subexample.amazon.com
walmart.en
walmart.uk
michigan.edu
I want to use python to extract anything before the last . but before the 1st period if there is one. So the results would look like this.
common_name
amazon
amazon
amazon
walmart
walmart
michigan
I found some examples of this here but it looks like it was an operator on a string and it was anything before the certain character not between them. The string operator may take awhile to run, so wondering if there is a function using pandas on the whole df by chance?
CodePudding user response:
This should work:
df['col'] = df['col'].str.rsplit('.', n=1).str[0].str.split('.').str[-1]
Output:
>>> df
col
0 common_name
1 amazon
2 amazon
3 amazon
4 walmart
5 walmart
6 michigan
CodePudding user response:
You could use pd.DataFrame.apply
along with a lambda function that returns the longest element after splitting (based on comment in richardec's answer):
In [1]: import pandas as pd
In [2]: d = {
...: 'domains': [
...: 'common_name',
...: 'www.amazon.com',
...: 'amazon.com',
...: 'subexample.amazon.com',
...: 'walmart.en',
...: 'walmart.uk',
...: 'michigan.edu',
...: 'tkoutletstore.co.uk',
...: 'tillyandotto.com.au',
...: ]
...: }
...: df = pd.DataFrame(data=d)
...: df
Out[2]:
domains
0 common_name
1 www.amazon.com
2 amazon.com
3 subexample.amazon.com
4 walmart.en
5 walmart.uk
6 michigan.edu
7 tkoutletstore.co.uk
8 tillyandotto.com.au
In [3]: df['extracted'] = df['domains'].apply(lambda d: max(d.split('.'), key=len))
In [4]: df
Out[4]:
domains extracted
0 common_name common_name
1 www.amazon.com amazon
2 amazon.com amazon
3 subexample.amazon.com subexample
4 walmart.en walmart
5 walmart.uk walmart
6 michigan.edu michigan
7 tkoutletstore.co.uk tkoutletstore
8 tillyandotto.com.au tillyandotto
CodePudding user response:
Pandas won't make things any faster computation-wise. This regex might work for you:
s.str.extract(r'(\w )(\.\w{2,3}) $')[0]
But a better solution would be: Extract domain from URL in python