Transfom values in dataset more quickly-CodePudding

I need to transform values above than 100 in 0, but, in the dataset that i need make that tranform has a 2 billions of values, and, this is the problem. I speed a lot of time to do that... (i need to do that transfomation 5 times).

I using a loop, for, with the function ".replace".

So, have any another function or idea to solve that problem?

CodePudding user response：

You could let pandas handle it for you by indexing a specific portion of your dataframe and setting a value for all columns or some columns: Here is an example with a basic dataframe.

import pandas as pd

df = pd.DataFrame({
    'a': list(range(200)),
    'b': list(range(200)),
})
df.loc[df['a'] > 100, 'a'] = 0
print(df['a'].unique())
df['b'].unique()

Here we replace all values in column a greather than 100 with 0. The last two statements prints the unique values in each series just to show the result of the performed action.

If your intent is to modify all columns for the matching records, you can omit specifying a column: df.loc[df['a'] > 100] = 0.

And if you want to modify multiple columns at once, just use an array with column names like this: df.loc[df['a'] > 100, ['a', 'b']] = 0.

Dont't forget to update the conditions accordingly with your app logic.

CodePudding user response：

Not entirely sure what do you want to do. Do you have a single array or tabular data? And if the latter, you want this to apply to all columns or just some of them?

Anyway, in case you have just an array :

a = np.array([10,100,101,301,10,43]) 
a[a>100] = 0
print(a)
# --> [ 10 100   0   0  10  43]

In case you have a dataframe:

df = pd.DataFrame({'a':np.arange(30,120,10),
               'b':np.arange(50,59),
               'c':np.arange(95,104),
               'd':np.arange(101,110)})

If you want to apply to a single column:

df['a'][df['a'] > 100] = 0

If you want to apply it to more than one columns, one way is:

apply_to_cols = ['a','c']

def all_or_nothing(v):
    if v > 100:
        return 0
    else:
        return v

df[apply_to_cols] = np.vectorize(all_or_nothing)(df[apply_to_cols])