I have a series s
that has entries that are lists, for example [1, 2, 3, NaN, NaN]
or [4, 5]
. These lists may contain NaNs as the last few elements, and I want to drop all entires in this series that contain NaN. I have so far used s.transform(lambda x: np.nan if np.isnan(x).any() else x).dropna()
, but this takes over a minute on just 21 million rows, and I am eventually planning on doing this with tens of billions of rows, so I need something fast. Thank you!
To emphasize, each entry in the series is a list, and so I cannot just use pd.dropna()
because there are no entries that are NaN since are lists themselves. I want to delete the lists (entries) that CONTAIN NaN. This is what the series s
might look like: pd.Series([1, 2, 3, NaN, NaN], [4, 5]...)
.
CodePudding user response:
Assuming this input:
from numpy import NaN
s = pd.Series([[1, 2, 3, NaN, NaN], [4, 5]])
You can use:
s2 = s[s.explode().notna().groupby(level=0).all()]
Or, with a list comprehension:
s2 = s[[pd.Series(x).notna().all() for x in s]]
output:
1 [4, 5]
dtype: object
CodePudding user response:
You can identify all index positions that are equal to NaN
for the exploded data frame and can then filter the data frame for those that are not in the index array:
ser = pd.DataFrame(data={"col": [[1, 2, 3, np.nan, np.nan], [3, 4, 5], [3, 9], [np.nan, 10]]})['col']
ser_exploded = ser.explode()
ser[~ser.index.isin(np.unique(ser_exploded[ser_exploded.isna()].index))]
--------------------------------------
1 [3, 4, 5]
2 [3, 9]
Name: col, dtype: object
--------------------------------------
CodePudding user response:
An alternative with multiprocessing
:
import pandas as pd
import numpy as np
import multiprocessing as mp
import time
def check_nan(s):
return s.explode().isna().groupby(level=0).max()
if __name__ == '__main__': # Do not remove this line! Mandatory
# Setup a minimal reproducible example
N = 10_000_000
s = pd.Series([[1, 2, 3, np.NaN, np.NaN], [4, 5]]).repeat(N)
s = s.sample(frac=1, ignore_index=True)
CHUNKSIZE = 10_000
start = time.time()
with mp.Pool(mp.cpu_count() - 1) as p:
results = p.map(check_nan, (s[i:i CHUNKSIZE] for i in range(0, len(s), CHUNKSIZE)))
m = pd.concat(results)
s = s[~m]
end = time.time()
print(f"Elapsed time: {end - start:.2f} seconds")
For 20,000,000 records:
[...]$ python mp.py
Elapsed time: 1.58 seconds
Note: without mp
, the execution time is 6.07 seconds:
start = time.time()
m = s.explode().isna().groupby(level=0).max()
s1 = s[~m]
end = time.time()
print(f"Elapsed time: {end - start:.2f} seconds")
CodePudding user response:
you convert it into a DF, explode, then drop all NA and finally concatenate it back to the list, as the original series
s.to_frame().reset_index().explode(0).dropna().groupby('index')[0].agg(list)
INPUT
0 [1, 2, 3, nan, nan]
1 [4, 5]
RESULT
0 [1, 2, 3]
1 [4, 5]
CodePudding user response:
If I understood your question properly, filtering rows using a mask which tests for NaN values should work
import pandas as pd
from numpy import nan as NaN
s = pd.Series([[1, 2, 3, NaN, NaN], [4, 5]])
s = s[~s.apply(lambda list1: any(pd.isna(x) for x in list1))]
print(s)
1 [4, 5]
dtype: object