I'm having a tough time coming up with a solution to categorize IP addresses in a pandas dataframe. Right now IPV4_SRC_ADDR
are dtype object.
This is the composition of my dataset:
IPV4_SRC_ADDR L4_SRC_PORT IPV4_DST_ADDR L4_DST_PORT PROTOCOL L7_PROTO IN_BYTES OUT_BYTES IN_PKTS OUT_PKTS TCP_FLAGS FLOW_DURATION_MILLISECONDS Label Attack
0 59.166.0.3 11088 149.171.126.7 6881 6 37.0 1540 1644 16 18 27 106 0 Benign
1 59.166.0.7 34968 149.171.126.4 12113 6 11.0 4352 2976 28 28 27 313 0 Benign
2 59.166.0.3 34512 149.171.126.9 13754 6 11.0 4512 2456 18 18 27 5 0 Benign
Let's say I want to include in the data that all IP ranges 59.166.0.X and 149.171.126.X are servers and that 10.40.85.X are clients, in a separate column. What would be the best way to go?
I have tried converting the first column to a new one and removing the dots and treat them as integers/floats. Then I tried with a lambda to say if x is between y and z, it gets categorized as 'Server'. But then I realized that this does not work because that logic is not correct. I'm having a hard time coming up with another solution.
CodePudding user response:
You can convert your IP addresses to integers:
import numpy as np
def ip2num(x):
"""Convert an IP Series or a string as numeric value."""
if isinstance(x, str):
return np.left_shift(np.array(x.split('.')).astype(int), [24, 16, 8, 0]).sum()
else:
return np.left_shift(x.str.split('.', expand=True).astype(int), [24, 16, 8, 0]).sum(axis=1)
df['IPV4_SRC_NUM'] = ip2num(df['IPV4_SRC_ADDR'])
df['IPV4_DST_NUM'] = ip2num(df['IPV4_DST_ADDR'])
Output:
>>> df.filter(like='IPV4')
IPV4_SRC_ADDR IPV4_DST_ADDR IPV4_SRC_NUM IPV4_DST_NUM
0 59.166.0.3 149.171.126.7 1000734723 2511044103
1 59.166.0.7 149.171.126.4 1000734727 2511044100
2 59.166.0.3 149.171.126.9 1000734723 2511044105
Now, you are able to filter your dataframe:
>>> (ip2num('59.166.0.0') <= df['IPV4_SRC_NUM']) & (df['IPV4_SRC_NUM'] <= ip2num('59.166.0.255'))
0 True
1 True
2 True
Name: IPV4_SRC_NUM, dtype: bool