I need to train an ML model using a large dataset. For this I'm using the dask library.
My dataset contains IP addresses (column index 0 and 2). I'm trying to convert these IP addresses into integer using the ipaddress python library. A sample of the dataset is given below:
IP Add Src | Port | IP Add Dest. | Port |
---|---|---|---|
9.166.0.5 | 1305 | 149.17.12.8 | 21 |
9.166.0.5 | 1305 | 149.17.12.8 | 21 |
9.166.0.5 | 1305 | 149.17.12.8 | 21 |
9.166.0.5 | 1305 | 149.17.12.8 | 21 |
9.166.0.5 | 1305 | 149.17.12.8 | 21 |
9.166.0.5 | 1305 | 149.17.12.8 | 21 |
Initially when using pandas dataframe, I used the following to convert the Ip addresses:
df['IP Add Src'] = df['IP Add Src'].apply(lambda x: int(ipaddress.IPv4Address(x)))
From what I've read, with dask, there are the apply, map_partitions and map functions which are available.
However, I'm still unsure how to use these functions to convert these ip addresses in place.
Any help on how I can implement this.
CodePudding user response:
The correct way to convert an IPv4-Address to a 32bit integer value is to "left shift" each octet by the appropriate number of bits, i.e. 24 for the first, 16 for the second and 8 for the third octet, and build the sum (or binary or) of the four resulting numbers:
from functools import reduce
ipAddress = '149.17.12.8'
ipAddressAsInt = reduce(lambda x,y: x y, [int(b)<<(8*(3-a)) for a,b in enumerate(ipAddress.split('.'))])
The expression in the code builds the 32bit value for each octet by using left-shift operator <<
on the octet value a
with an argument that is built from the position of the octet b
returned by enumerate.
reduce
than uses a simple lambda function to add the values. Note that lambda x,y: x|y
also works and would technically actually be more correct.
Of course you could also do the left-shift directly in the lambda function, but imho this makes it less readable:
reduce(lambda x,y: x|int(y[1])<<(8*(3-y[0])), enumerate(ipAddress.split('.')), 0)
Update: A more readable version is this:
reduce(lambda x,y: x<<8|int(y), ipAddress.split('.'), 0)
Start with value 0, then for each octet in the IP-Address, left shift the current value by 8 and add the current octet.
To apply this to a dataframe column, use map as opposed to apply (check here).
Small test/proof-of-concept:
>>> import pandas as pd
>>> from functools import reduce
>>> df = pd.DataFrame({'sourceHost': ['luke', 'leia', 'bb8'],
... 'sourceIp': ['10.24.53.128', '10.24.125.44', '10.24.133.253'],
... 'destHost': ['vader', 'palpatin', 'keylo'],
... 'destIp': ['10.25.88.124', '10.25.230.12', '10.25.240.1']})
>>> df
sourceHost sourceIp destHost destIp
0 luke 10.24.53.128 vader 10.25.88.124
1 leia 10.24.125.44 palpatin 10.25.230.12
2 bb8 10.24.133.253 keylo 10.25.240.1
>>> df['sourceIp'] = df['sourceIp'].map(lambda ip: reduce(lambda x,y: x<<8|int(y), ip.split('.'), 0))
>>> df
sourceHost sourceIp destHost destIp
0 luke 169358720 vader 10.25.88.124
1 leia 169377068 palpatin 10.25.230.12
2 bb8 169379325 keylo 10.25.240.1
>>> df['destIp'] = df['destIp'].map(lambda ip: reduce(lambda x,y: x<<8|int(y), ip.split('.'), 0))
>>> df
sourceHost sourceIp destHost destIp
0 luke 169358720 vader 169433212
1 leia 169377068 palpatin 169469452
2 bb8 169379325 keylo 169472001
CodePudding user response:
import pandas as pd
import dask.dataframe as dd
from functools import reduce
df = pd.DataFrame({'ip': ['9.166.0.1', '9.166.0.2', '9.166.0.3', '9.166.0.4', '9.166.0.5'],
'port': [80, 81, 82, 83, 84]})
ddf = dd.from_pandas(df, 2)
def strip_to_int(str_ip):
arr_ip = str_ip.split('.')
if len(arr_ip)==4:
return reduce(lambda x,y: x<<8|int(y), arr_ip, 0)
return None
series_int_ip = ddf.ip.apply(strip_to_int)
ddf.assign(ip=series_int_ip).compute()
result:
ip port
0 161873921 80
1 161873922 81
2 161873923 82
3 161873924 83
4 161873925 84