Home > Software design >  Convert IP address to integer in dask dataframe
Convert IP address to integer in dask dataframe


I need to train an ML model using a large dataset. For this I'm using the dask library.

My dataset contains IP addresses (column index 0 and 2). I'm trying to convert these IP addresses into integer using the ipaddress python library. A sample of the dataset is given below:

IP Add Src Port IP Add Dest. Port 1305 21 1305 21 1305 21 1305 21 1305 21 1305 21

Initially when using pandas dataframe, I used the following to convert the Ip addresses:

df['IP Add Src'] = df['IP Add Src'].apply(lambda x: int(ipaddress.IPv4Address(x)))

From what I've read, with dask, there are the apply, map_partitions and map functions which are available.

However, I'm still unsure how to use these functions to convert these ip addresses in place.

Any help on how I can implement this.

CodePudding user response:

The correct way to convert an IPv4-Address to a 32bit integer value is to "left shift" each octet by the appropriate number of bits, i.e. 24 for the first, 16 for the second and 8 for the third octet, and build the sum (or binary or) of the four resulting numbers:

from functools import reduce
ipAddress = ''
ipAddressAsInt = reduce(lambda x,y: x y, [int(b)<<(8*(3-a)) for a,b in enumerate(ipAddress.split('.'))])

The expression in the code builds the 32bit value for each octet by using left-shift operator << on the octet value a with an argument that is built from the position of the octet b returned by enumerate. reduce than uses a simple lambda function to add the values. Note that lambda x,y: x|y also works and would technically actually be more correct.

Of course you could also do the left-shift directly in the lambda function, but imho this makes it less readable:

reduce(lambda x,y: x|int(y[1])<<(8*(3-y[0])), enumerate(ipAddress.split('.')), 0)

Update: A more readable version is this:

reduce(lambda x,y: x<<8|int(y), ipAddress.split('.'), 0)

Start with value 0, then for each octet in the IP-Address, left shift the current value by 8 and add the current octet.

To apply this to a dataframe column, use map as opposed to apply (check here).

Small test/proof-of-concept:

>>> import pandas as pd
>>> from functools import reduce
>>> df = pd.DataFrame({'sourceHost': ['luke', 'leia', 'bb8'],
...                    'sourceIp': ['', '', ''],
...                    'destHost': ['vader', 'palpatin', 'keylo'],
...                    'destIp': ['', '', '']})
>>> df
  sourceHost       sourceIp  destHost        destIp
0       luke     vader
1       leia  palpatin
2        bb8     keylo
>>> df['sourceIp'] = df['sourceIp'].map(lambda ip: reduce(lambda x,y: x<<8|int(y), ip.split('.'), 0))
>>> df
  sourceHost   sourceIp  destHost        destIp
0       luke  169358720     vader
1       leia  169377068  palpatin
2        bb8  169379325     keylo
>>> df['destIp'] = df['destIp'].map(lambda ip: reduce(lambda x,y: x<<8|int(y), ip.split('.'), 0))
>>> df
  sourceHost   sourceIp  destHost     destIp
0       luke  169358720     vader  169433212
1       leia  169377068  palpatin  169469452
2        bb8  169379325     keylo  169472001

CodePudding user response:

import pandas as pd
import dask.dataframe as dd
from functools import reduce

df = pd.DataFrame({'ip': ['', '', '', '', ''],
                   'port': [80, 81, 82, 83, 84]})
ddf = dd.from_pandas(df, 2)

def strip_to_int(str_ip):
    arr_ip = str_ip.split('.')
    if len(arr_ip)==4:
        return reduce(lambda x,y: x<<8|int(y), arr_ip, 0)
    return None

series_int_ip = ddf.ip.apply(strip_to_int)


         ip     port
0   161873921   80
1   161873922   81
2   161873923   82
3   161873924   83
4   161873925   84
  • Related