I'm trying to get an output of 3 headers.
host | url | URL Count
My code is as follows:
import pandas as pd
import numpy as np
data = {'host': ['1.1.1.1', '192.168.1.1', '192.168.1.1', '192.168.1.1', '172.16.1.2', '192.168.1.1', '172.16.1.2', '10.2.3.4', '10.0.0.1', '10.0.0.1'],
'url': ['capps.test.com', 'attilab-admin.test.com', 'emea-solutions-admin.test.com', 'ilab-admin.test.com', 'learning.test.com', 'mktextfw.test.com', 'sandbox.learning.test.com', 'mynetwork.test.com', 'www.letsdoit.com', 'www.mysite.com']}
df = pd.DataFrame(data)
df['URL Count'] = df.groupby('host')['url'].transform('count')
df.pivot(index='host', columns='url')
df2 = df.set_index(['host', 'url'])
df.groupby(['host', 'url']).size()
stacked = pd.DataFrame(df2.stack())
print(stacked)
The output looks like this:
I'm looking to group all URLs to the IP addresses they are associated to and move the URL Count to a column.
The output I'm going for is this:
CodePudding user response:
import pandas as pd
data = {'host': ['1.1.1.1', '192.168.1.1', '192.168.1.1', '192.168.1.1', '172.16.1.2', '192.168.1.1', '172.16.1.2', '10.2.3.4', '10.0.0.1', '10.0.0.1'], 'url': ['capps.test.com', 'attilab-admin.test.com', 'emea-solutions-admin.test.com', 'ilab-admin.test.com', 'learning.test.com', 'mktextfw.test.com', 'sandbox.learning.test.com', 'mynetwork.test.com', 'www.letsdoit.com', 'www.mysite.com']}
df = pd.DataFrame(data)
out = df.pivot_table(index='host', values='url', aggfunc=[list, 'count'], margins=True)
out.columns = ['urls', 'count']
out.loc['All', 'urls'] = ''
out = out.explode('urls').set_index('urls', append=True)
out = pd.concat([out.iloc[:-1].sort_values('count', ascending=False), out.iloc[-1:]])
print(out)
Output:
count
host urls
192.168.1.1 attilab-admin.test.com 4
emea-solutions-admin.test.com 4
ilab-admin.test.com 4
mktextfw.test.com 4
10.0.0.1 www.letsdoit.com 2
www.mysite.com 2
172.16.1.2 learning.test.com 2
sandbox.learning.test.com 2
1.1.1.1 capps.test.com 1
10.2.3.4 mynetwork.test.com 1
All 10
CodePudding user response:
Here is two propositions :
out1 = (df.assign(URL_Count = df.groupby('host')['url'].transform('count'))
.set_index('host')
.groupby(level=0)
.apply(lambda x: x.reset_index(drop=True))
)
Or,
out2 = (df.assign(URL_Count = df.groupby('host')['url'].transform('count'))
.groupby(['host', 'url']).sum()
)
# Outputs :
print(out1)
url URL_Count
host
1.1.1.1 0 capps.test.com 1
10.0.0.1 0 www.letsdoit.com 2
1 www.mysite.com 2
10.2.3.4 0 mynetwork.test.com 1
172.16.1.2 0 learning.test.com 2
1 sandbox.learning.test.com 2
192.168.1.1 0 attilab-admin.test.com 4
1 emea-solutions-admin.test.com 4
2 ilab-admin.test.com 4
3 mktextfw.test.com 4
print(out2)
URL_Count
host url
1.1.1.1 capps.test.com 1
10.0.0.1 www.letsdoit.com 2
www.mysite.com 2
10.2.3.4 mynetwork.test.com 1
172.16.1.2 learning.test.com 2
sandbox.learning.test.com 2
192.168.1.1 attilab-admin.test.com 4
emea-solutions-admin.test.com 4
ilab-admin.test.com 4
mktextfw.test.com 4
CodePudding user response:
df['URL Count'] = df.groupby('host')['url'].transform('count')
df.sort_values(['host','url'] , inplace=True)
host url URL Count
0 1.1.1.1 capps.test.com 1
8 10.0.0.1 www.letsdoit.com 2
9 10.0.0.1 www.mysite.com 2
7 10.2.3.4 mynetwork.test.com 1
4 172.16.1.2 learning.test.com 2
6 172.16.1.2 sandbox.learning.test.com 2
1 192.168.1.1 attilab-admin.test.com 4
2 192.168.1.1 emea-solutions-admin.test.com 4
3 192.168.1.1 ilab-admin.test.com 4
5 192.168.1.1 mktextfw.test.com 4
in case you're looking for the values be blank instead of repeating
df['host']=df['host'].mask(df['host'].eq(df['host'].shift(1)),'')
df
host url URL Count
0 1.1.1.1 capps.test.com 1
8 10.0.0.1 www.letsdoit.com 2
9 www.mysite.com 2
7 10.2.3.4 mynetwork.test.com 1
4 172.16.1.2 learning.test.com 2
6 sandbox.learning.test.com 2
1 192.168.1.1 attilab-admin.test.com 4
2 emea-solutions-admin.test.com 4
3 ilab-admin.test.com 4
5 mktextfw.test.com 4