Home > Net >  What is the efficient way to find missing rows of a dataframe and put NaN for columns?
What is the efficient way to find missing rows of a dataframe and put NaN for columns?

Time:11-30

Consider I have dataframe which the first column is the datetime, and the other columns are data in the specified datetime (Data is collected hourly, so first column of every row is one hour after the previous row). In this dateframe data for some datetimes are missed. I want to make a new dataframe in which missing rows are replaced with the related datetime and NaNs for other columns.

I tried to read the dataframe from a csv as first DF, and created an empty DF in a loop to create datetime for every hour chronologically, then I take the data from first DF and put it in the second DF and if there is no data from first DF for the specified datetime I put NaN in the row.

This works for me, but it's very slow and takes 3 days to run for 70000 rows and I guess there is an efficient and pythonic way to do this.

I guess there is a better way like this one but I need it for datetime.

I'm looking for an answer like Replacing one data frame value from another based on timestamp Criterion but just with datetime.

CodePudding user response:

I think you could create a df where you have the timestamp as your index.

You can then use pd.date_range to create a full datetime range for every hour from min to max.

You can then run the Index.difference to efficiently find any timestamps that are missing from your original dataframe --> this will be the index of a new df with missing values.

Then just fill in missing columns with NaN

import pandas as pd
import numpy as np

# name of your datetime column
datetime_col = 'datetime'
 
# mock up some data
data = {
    datetime_col: [
        '2021-01-18 00:00:00', '2021-01-18 01:00:00',
        '2021-01-18 03:00:00', '2021-01-18 06:00:00'],
    'extra_col1': ['b', 'c', 'd', 'e'],
    'extra_col2': ['g', 'h', 'i', 'j'],
}

df = pd.DataFrame(data)
 
# Setting the Date values as index
df = df.set_index(datetime_col)
 
# to_datetime() method converts string
# format to a DateTime object
df.index = pd.to_datetime(df.index)
 
# create df of missing dates from the sequence
# starting from min dateitme, to max, with hourly intervals
new_df = pd.DataFrame(
    pd.date_range(
        start=df.index.min(), 
        end=df.index.max(),
        freq='H'
    ).difference(df.index)
)

# you will need to add these columns to your df
missing_columns = [col for col in df.columns if col!=datetime_col]

# add null data
new_df[missing_columns] = np.nan

# fix column names
new_df.columns = [datetime_col]   missing_columns

new_df

CodePudding user response:

I am not sure I follow exactly what you need, i.e what is the frequency of the datetimes you are trying to complete, but assuming it is hourly, then you could try something along those lines:

  1. Use the pd.date_range(start_date, end_date, freq='H') function from pandas to create a pandas DataFrame with all the missing hourly times you need (one column and same name than the first column in your initial DataFrame). See documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html
  2. Use the pd.merge(initial_df, complete_df, how='outer') function to perform an outer merge between the two dataframes. If I am not mistaken all columns of cases where you had no date in the initial DataFrame should be filled with NAs by default.

Reproducible example below using Matt's example:

import pandas as pd
import numpy as np
 
# mock up some data
data = {
    'date': [
        '2021-01-18 00:00:00', '2021-01-18 01:00:00',
        '2021-01-18 03:00:00', '2021-01-18 06:00:00'],
    'extra_col1': ['b', 'c', 'd', 'e'],
    'extra_col2': ['g', 'h', 'i', 'j'],
}

df = pd.DataFrame(data)
 
# Use to_datetime() method to convert string
# format to a DateTime object
df['date'] = pd.to_datetime(df['date'])
 
# Create df with missing dates from the sequence
# starting from min dateitme, to max, with hourly intervals
new_df = pd.DataFrame(
    {'date': pd.date_range(
        start=df['date'].min(), 
        end=df['date'].max(),
        freq='H'
    )}
)

# Use the merge function to perform an outer merge
# and reorder the date column
result_df = pd.merge(df,new_df,how='outer')
result_df.sort_values(by='date',ascending=True, inplace=True)
  • Related