Home > Back-end >  Is there a more efficient way to apply this custom function to the entire dataset?
Is there a more efficient way to apply this custom function to the entire dataset?

Time:12-13

I have a dataset that looks like this with IP addresses (for security's sake, these are all made up):

0 1 2
100.0.200.0 160.60.30.0 NaN
NaN 101.60.10.0 10.0.0.1

I want to apply a function that would take these IP addresses (where they exist) and essentially return a sliced version of them by removing the fourth octet so it should look like this:

0 1 2
100.0.200 160.60.30 NaN
NaN 101.60.10 10.0.0

I have written the below code that does the job but it is very slow since it uses recursion and I want to be able to do this faster.

def sliceip(row):
 row = str(row)
 return row.rsplit(".",1)[0]

def applysliceip(rowx):
 for i, item in enumerate(rowx):
     rowx[i] = sliceip(item)
 return rowx


# And I apply this to the entire dataframe as such:

split_IPs = IPs.apply(lambda row: applysliceip(row))

So my Question is there a more pythonic and faster way to accomplish the above and return the same output without having to use so much memory?

CodePudding user response:

You can use a regular expression to match and replace instead of using a custom function.

IPs.replace(r"(\d \.\d \.\d )\.\d ", r"\1", regex=True)

CodePudding user response:

A possible solution, which uses pandas.DataFrame.applymap and regex to replace the last . and digits by empty string:

import re

df.applymap(lambda x: re.sub(r'\.\d $', '', x))

Output:

           0          1       2
0  100.0.200  160.60.30     NaN
1        NaN  101.60.10  10.0.0

A faster solution, based on numpy:

import re

v = np.vectorize(lambda x: re.sub(r'\.\d $', '', x))
pd.DataFrame(np.where(pd.notnull(df), v(df), df))
  • Related