Home > Software design >  Stack and groupby
Stack and groupby

Time:09-27

I'm trying to get an output of 3 headers.

host | url | URL Count

My code is as follows:

import pandas as pd
import numpy as np

data = {'host': ['1.1.1.1', '192.168.1.1', '192.168.1.1', '192.168.1.1', '172.16.1.2', '192.168.1.1', '172.16.1.2', '10.2.3.4', '10.0.0.1', '10.0.0.1'],
        'url': ['capps.test.com', 'attilab-admin.test.com', 'emea-solutions-admin.test.com', 'ilab-admin.test.com', 'learning.test.com', 'mktextfw.test.com', 'sandbox.learning.test.com', 'mynetwork.test.com', 'www.letsdoit.com', 'www.mysite.com']}

df = pd.DataFrame(data)
df['URL Count'] = df.groupby('host')['url'].transform('count')
df.pivot(index='host', columns='url')
df2 = df.set_index(['host', 'url'])
df.groupby(['host', 'url']).size()
stacked = pd.DataFrame(df2.stack())
print(stacked)

The output looks like this:

Output

I'm looking to group all URLs to the IP addresses they are associated to and move the URL Count to a column.

The output I'm going for is this: Desired output

CodePudding user response:

import pandas as pd
data = {'host': ['1.1.1.1', '192.168.1.1', '192.168.1.1', '192.168.1.1', '172.16.1.2', '192.168.1.1', '172.16.1.2', '10.2.3.4', '10.0.0.1', '10.0.0.1'], 'url': ['capps.test.com', 'attilab-admin.test.com', 'emea-solutions-admin.test.com', 'ilab-admin.test.com', 'learning.test.com', 'mktextfw.test.com', 'sandbox.learning.test.com', 'mynetwork.test.com', 'www.letsdoit.com', 'www.mysite.com']}
df = pd.DataFrame(data)
out = df.pivot_table(index='host', values='url', aggfunc=[list, 'count'], margins=True)
out.columns = ['urls', 'count']
out.loc['All', 'urls'] = ''
out = out.explode('urls').set_index('urls', append=True)
out = pd.concat([out.iloc[:-1].sort_values('count', ascending=False), out.iloc[-1:]])
print(out)

Output:

                                           count
host        urls
192.168.1.1 attilab-admin.test.com             4
            emea-solutions-admin.test.com      4
            ilab-admin.test.com                4
            mktextfw.test.com                  4
10.0.0.1    www.letsdoit.com                   2
            www.mysite.com                     2
172.16.1.2  learning.test.com                  2
            sandbox.learning.test.com          2
1.1.1.1     capps.test.com                     1
10.2.3.4    mynetwork.test.com                 1
All                                           10

CodePudding user response:

Here is two propositions :

out1 = (df.assign(URL_Count = df.groupby('host')['url'].transform('count'))
          .set_index('host')
          .groupby(level=0)
          .apply(lambda x: x.reset_index(drop=True))
        )

Or,

out2 = (df.assign(URL_Count = df.groupby('host')['url'].transform('count'))
          .groupby(['host', 'url']).sum()
        )

# Outputs :

print(out1)
                                         url  URL_Count
host                                                   
1.1.1.1     0                 capps.test.com          1
10.0.0.1    0               www.letsdoit.com          2
            1                 www.mysite.com          2
10.2.3.4    0             mynetwork.test.com          1
172.16.1.2  0              learning.test.com          2
            1      sandbox.learning.test.com          2
192.168.1.1 0         attilab-admin.test.com          4
            1  emea-solutions-admin.test.com          4
            2            ilab-admin.test.com          4
            3              mktextfw.test.com          4

print(out2)
                                           URL_Count
host        url                                     
1.1.1.1     capps.test.com                         1
10.0.0.1    www.letsdoit.com                       2
            www.mysite.com                         2
10.2.3.4    mynetwork.test.com                     1
172.16.1.2  learning.test.com                      2
            sandbox.learning.test.com              2
192.168.1.1 attilab-admin.test.com                 4
            emea-solutions-admin.test.com          4
            ilab-admin.test.com                    4
            mktextfw.test.com                      4

CodePudding user response:

df['URL Count'] = df.groupby('host')['url'].transform('count') 
df.sort_values(['host','url'] , inplace=True)
    host            url                      URL Count
0   1.1.1.1         capps.test.com                  1
8   10.0.0.1        www.letsdoit.com                2
9   10.0.0.1        www.mysite.com                  2
7   10.2.3.4        mynetwork.test.com              1
4   172.16.1.2      learning.test.com               2
6   172.16.1.2      sandbox.learning.test.com       2
1   192.168.1.1     attilab-admin.test.com          4
2   192.168.1.1     emea-solutions-admin.test.com   4
3   192.168.1.1     ilab-admin.test.com             4
5   192.168.1.1     mktextfw.test.com               4

in case you're looking for the values be blank instead of repeating

df['host']=df['host'].mask(df['host'].eq(df['host'].shift(1)),'')
df
    host        url                     URL Count
0   1.1.1.1     capps.test.com                  1
8   10.0.0.1    www.letsdoit.com                2
9               www.mysite.com                  2
7   10.2.3.4    mynetwork.test.com              1
4   172.16.1.2  learning.test.com               2
6               sandbox.learning.test.com       2
1   192.168.1.1 attilab-admin.test.com          4
2               emea-solutions-admin.test.com   4
3               ilab-admin.test.com             4
5               mktextfw.test.com               4
  • Related