Home > Enterprise >  Making python code faster for processing 24 million records
Making python code faster for processing 24 million records

Time:11-09

I am trying to process pandas dataframe. I am applying function to one of the column.

The function is:

def separate_string(sentence):
    string_even = ""
    if sentence is not None:
        l = list(sentence)
        list_even = list()
        index = 0    
        for letter in l:
            if index % 2 != 0:
               if abs(ord(letter)-3) < 1114111:
                    list_even.append((chr(abs(ord(letter)-3))))
               string_even = "".join(list_even)
            index  = 1
    return(str(string_even))

Pandas dataframe:

df['re'] = df.col1.apply(separate_string)

I am running this on PC with 64GB RAM 2.19Ghz 7 processor. Why the code never completes?

CodePudding user response:

If I were you, I'd try Cythonizing your Python code. Essentially that would make it C code that would run (hopefully) orders of magnitude faster.

CodePudding user response:

I think this does what you want. You might have to explicitly return None if you need that rather than an empty string.

There are a bunch of things removed like unneeded casts and manual maintenance of an index as well as a test that codepoints are less the than 1114111 as they all are going to be.

def separate_string(sentence):
    return "".join(chr(abs(ord(letter) -3)) for letter in sentence[1::2])

We can timeit to see if we have improved things:

import timeit

setup_orig = '''
test = "This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever."
def separate_string(sentence):
    string_even = ""
    if sentence is not None:
        l = list(sentence)
        list_even = list()
        index = 0    
        for letter in l:
            if index % 2 != 0:
               if abs(ord(letter)-3) < 1114111:
                    list_even.append((chr(abs(ord(letter)-3))))
               string_even = "".join(list_even)
            index  = 1
    return(str(string_even))
'''

setup_new = '''
test = "This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever."
def separate_string(sentence):
    return "".join(chr(abs(ord(letter) -3)) for letter in sentence[1::2])
'''

print(timeit.timeit('separate_string(test)', setup=setup_orig, number=100_000))
print(timeit.timeit('separate_string(test)', setup=setup_new, number=100_000))

On my laptop that gives results like:

5.33
0.95

So it seems like it might be worth exploring as part of your solution.

  • Related