Home > OS >  pandas apply with assignment on large dataframe
pandas apply with assignment on large dataframe

Time:05-20

I have a very long dataframe dfWeather with the column TEMP. Because of its size, I want to keep only relevant information. Concretely, keep only entries, where the temperature changed by more than 1 since the last entry I kept. I want to use dfWeather.apply, since it seems to iterate much faster (10x) over the rows than a for-loop over dfWeather.iloc. I tried the following.

dfTempReduced = pd.DataFrame(columns = dfWeather.columns)    
dfTempReduced.append(dfWeather.iloc[0])    
dfWeather.apply(lambda x: dfTempReduced = dfTempReduced.append(x) if np.abs(TempReduced[-1].TEMP - x.TEMP) >= 1 else None, axis = 1)

unfortunately I get the error

SyntaxError: expression cannot contain assignment, perhaps you meant "=="? 

Is there a fast way to get that desired result? Thanks!

EDIT: Here is some example data

dfWeather[200:220].TEMP
Out[208]: 
200    12.28
201    12.31
202    12.28
203    12.28
204    12.24
205    12.21
206    12.17
207    11.93
208    11.83
209    11.76
210    11.66
211    11.55
212    11.48
213    11.43
214    11.37
215    11.33
216    11.36
217    11.33
218    11.29
219    11.27

The desired result would yield only the first and the last entry, since the absolute difference is larger than 1. The first entry is always included.

CodePudding user response:

If you don't want to call this recursive (so you have [1, 2, 3] and you want to keep [1, 3] because 2 is only 1 degree larger than 1 but 3 is more than 1 degree larger than 1, but not than 2) than you can simply use diff.

However, this doesn't work if the values stay longer below the 1°C threshold. To overcome this limitation, you could round the values (to whatever precision but 1°C suggests that to zero-precision would be a good idea;) )

Let us create an example:

import pandas as pd
import numpy as np

df = pd.DataFrame()
df['TEMP'] = np.random.rand(100) * 2

so now if you are OK with using diff it can be done very efficiently just by:

# either slice
lg = df['TEMP'].apply(round).diff().abs() > 1
df = df[lg]

# or drop
lg = df['TEMP'].apply(round).diff().abs()  < 1
df.drop(index=lg.index, inplace=True)

You even have two options to to the reduction. I guess that drop take a minimal twink longer but is more memory efficient than the slicing way.

  • Related