Using python generators with lots of data-CodePudding

I have a dataset consisting of 250k items that need to meet certain criteria before being added to a list/generator. To speed up the processing, I want to use generators, but I am uncertain about whether to filter the data with a function that yields the filtered sample, or if I should just return the filtered sample to a generator object, etc. I would like the final object to only include samples that met the filter criteria, but by default python will return/yield a NoneType object. I have included example filter functions, data (the real problem uses strings, but for simplicity I use floats from random normal distribution), and what I intend to do with the data below.

How should I efficiently use generators in this instance? Is it even logical/efficient to use generators for this purpose? I know that I can check to see if an element from the return function is None to exclude it from the container (list/generator), but how can I do this with the function that yields values?

# For random data
import numpy as np

# Functions
def filter_and_yield(item_in_data):
    if item_in_data > 0.0:
        yield item_in_data

def filter_and_return(item_in_data):
    if item_in_data > 0.0:
        return item_in_data

# Arbitrary data
num_samples = 250 * 10**3
data = np.random.normal(size=(num_samples,))

# Should I use this: generator with generator elements?
filtered_data_as_gen_with_gen_elements = (filter_and_yield(item) for item in data)

# Should I use this: list with generator elements?
filtered_data_as_lst_with_gen_elements = [filter_and_yield(item) for item in data]

# Should I use this: generator with non-generator elements?
filtered_data_as_gen_with_non_gen_elements = (
    filter_and_return(item) for item in data if filter_and_return(item) is not None)

# Should I use this: list with non-generator elements?
filtered_data_as_lst_with_non_gen_elements = [
    filter_and_return(item) for item in data if filter_and_return(item) is not None]

# Saving the data as csv -- note, `filtered_data` is NOT defined 
# but is a place holder for whatever correct way of filtering the data is
df = pd.DataFrame({'filtered_data': filtered_data})
df.to_csv('./filtered_data.csv')

CodePudding user response：

The short answer is that none of these are best. Numpy and pandas include a lot of C and Fortan code that works on hardware level data types stored in contiguous arrays. Python objects, even low level ones like int and float are relatively bulky. They include the standard python object header and are allocated on the heap. And even simple operations like > require a call to one of its methods.

Its better use use numpy/pandas functions and operators as much as possible. These packages have overloaded the standard python operators to work on entire sets of data in one call.

df = pd.DataFrame({'filtered_data': data[data > 0.0]})

Here, data > 0.0 created a new numpy array of true/false for the comparison. data[...] created a new array holding only the values of data that were also true.

Other notes

filter_and_yield is a generator that will iterate 0 or 1 values. Python turned it into a generator because it has a yield. When it returns None, python turns it into a StopIteration exception. The consumer of this generator will not see the None.

(filter_and_yield(item) for item in data) is a generator that returns generators. If you use it, you'll end up with dataframe column of generators.

[filter_and_yield(item) for item in data] is a list of generators (because filter_and_yield is a generator). When pandas creates a column, it needs to know the column size. So it expands generators into lists like you've done here. You can make this for pandas, doesn't really matter. Except that pandas deletes that list when done, which reduces memory usage.

(filter_and_return(item) for item in data if filter_and_return(item) is not None) This one works, but its pretty slow. data holds a hardware level array of integers. for item in data has to convert each of those integers into python level integers and the nfilter_and_return(item) is a relatively expensive function call. This could be rewritten as (value for value in (filter_and_return(item) for item in data) if value is not None) to halve the number of function calls.

[filter_and_return(item) for item in data if filter_and_return(item) is not None] As mentioned above. its okay to do this, but delete when done to conserve memory.