Home > Software engineering >  What is a good way to categorize IP-addresses in pandas?
What is a good way to categorize IP-addresses in pandas?

Time:01-24

I'm having a tough time coming up with a solution to categorize IP addresses in a pandas dataframe. Right now IPV4_SRC_ADDR are dtype object.

This is the composition of my dataset:

IPV4_SRC_ADDR   L4_SRC_PORT     IPV4_DST_ADDR   L4_DST_PORT     PROTOCOL    L7_PROTO    IN_BYTES    OUT_BYTES   IN_PKTS     OUT_PKTS    TCP_FLAGS   FLOW_DURATION_MILLISECONDS  Label   Attack
0   59.166.0.3  11088   149.171.126.7   6881    6   37.0    1540    1644    16  18  27  106     0   Benign
1   59.166.0.7  34968   149.171.126.4   12113   6   11.0    4352    2976    28  28  27  313     0   Benign
2   59.166.0.3  34512   149.171.126.9   13754   6   11.0    4512    2456    18  18  27  5   0   Benign

Let's say I want to include in the data that all IP ranges 59.166.0.X and 149.171.126.X are servers and that 10.40.85.X are clients, in a separate column. What would be the best way to go?

I have tried converting the first column to a new one and removing the dots and treat them as integers/floats. Then I tried with a lambda to say if x is between y and z, it gets categorized as 'Server'. But then I realized that this does not work because that logic is not correct. I'm having a hard time coming up with another solution.

CodePudding user response:

You can convert your IP addresses to integers:

import numpy as np

def ip2num(x):
   """Convert an IP Series or a string as numeric value."""
   if isinstance(x, str):
       return np.left_shift(np.array(x.split('.')).astype(int), [24, 16, 8, 0]).sum()
   else:
       return np.left_shift(x.str.split('.', expand=True).astype(int), [24, 16, 8, 0]).sum(axis=1)

df['IPV4_SRC_NUM'] = ip2num(df['IPV4_SRC_ADDR'])
df['IPV4_DST_NUM'] = ip2num(df['IPV4_DST_ADDR'])

Output:

>>> df.filter(like='IPV4')
  IPV4_SRC_ADDR  IPV4_DST_ADDR  IPV4_SRC_NUM  IPV4_DST_NUM
0    59.166.0.3  149.171.126.7    1000734723    2511044103
1    59.166.0.7  149.171.126.4    1000734727    2511044100
2    59.166.0.3  149.171.126.9    1000734723    2511044105

Now, you are able to filter your dataframe:

>>> (ip2num('59.166.0.0') <= df['IPV4_SRC_NUM']) & (df['IPV4_SRC_NUM'] <= ip2num('59.166.0.255'))

0    True
1    True
2    True
Name: IPV4_SRC_NUM, dtype: bool
  • Related