Home > front end >  Is there a fast way to slice a Pandas dataframe by index so that missing rows are filled with NaNs?
Is there a fast way to slice a Pandas dataframe by index so that missing rows are filled with NaNs?

Time:10-13

I need to assign a stack of rows from a pd.DataFrame by index to a matrix of known size as efficiently as possible. Many if the indices will not exist in the dataframe, and these rows must be filled with NaNs. This operation will happen inside a loop iterating over each row in the dataframe and must be as fast as possible.

In short, I want to speed up the following sequence:

# DF to iterate over
df = pd.DataFrame({'a': [1,2,3,4], 'b': [2,4,6,8], 'c': [3,6,9,12]})

# (Relative) DF indices to return if possible
indices = [-2, 0, 2, 4]

len_df = len(df)
len_indices = len(indices)
len_cols = len(df.columns)

values = np.zeros((len_df, len_indices, len_cols))

for i in range(len_df):
   for j, n in enumerate(indices):
      idx = i   n
      try:
         assert idx > 0  # avoid wrapping around
         values[i, j] = df.iloc[idx]
      except:
         values[i, j] = np.nan

values  

Returns

[[[nan, nan, nan],
  [  1,   2,   3],
  [  3,   6,   9],
  [nan, nan, nan]],

 [[nan, nan, nan],
   [  2,   4,   6],
   [  4,   8,  12],
   [nan, nan, nan]],

  [[  1,   2,   3],
   [  3,   6,   9],
   [nan, nan, nan],
   [nan, nan, nan]],

  [[  2,   4,   6],
   [  4,   8,  12],
   [nan, nan, nan],
   [nan, nan, nan]]]

As desired, but is quite slow. Any suggestions?

As desired

CodePudding user response:

Here's a solution that builds a lookup table:

# repeat indices into new dimension
rep_ids = np.tile(indices, (len(df), 1))

# add row index to the tiled ids
range_ids = rep_ids   np.arange(len(df))[:, np.newaxis]

# check which ids make a valid lookup in the df
# TODO: do you really mean > 0 here in the original code, not >= 0?
valid_test = (range_ids > 0) & (range_ids < len(df))

# lookup table, using 0 for invalid ids (will be overwritten later)
lookup_ids = np.where(valid_test, range_ids, 0)

# lookup all values
values = df.values[lookup_ids].astype(np.float64)

# set the invalids to nan
values[~valid_test] = np.nan
  • Related