I have a CSV file containing values for 1988-2020. The first two columns in the file contain year and month (1988,1 …. 2020,12). The remaining column contains respective values of each date of the month. However, there are missing values for particular dates, and also some months are missing. The missing values need to be replaced by “NA” OR “-99”, and it also needs to consider leap year. I need to convert this data in the following format without skipping a date from 01-01-1988 to 31-12-2020.
Column1,Column2
“dd-mm-yyyy”, “Value”
CodePudding user response:
Here I attempted to reshape your data into rows from 01-01-1988 to 31-12-2020 adding missing days filled with NaN
in every missing month. Could have missed something, so don't hesitate to check it yourself and let me know if something was wrong:
import pandas as pd
import numpy as np
# parse first two columns as dates
df = pd.read_csv('/path_to_file/Data.csv', parse_dates=[[0, 1]], header=None)
# rename columns
df.columns = range(0, 33)
# inspecting the df has shown that the row with index 2016-12-01 is invalid
# and the column 32 is empty
df = df.drop(columns=32).set_index(0).drop(index='2016-12-01')
# fill missing data in non-month-end days with -99
df.iloc[:, 0:28] = df.iloc[:, 0:28].fillna(value=-99, axis=0)
# reformat index
df.index = df.index.strftime('%m-%Y')
print(df)
Intermediate output:
1 2 3 4 5 6 7 8 9 10 ... 22 \
0 ...
01-1988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
02-1988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
03-1988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
04-1988 0.0 0.0 0.0 0.0 0.0 0.0 36.1 0.0 0.0 0.0 ... 0.0
01-1990 9.8 13.0 0.0 0.0 0.0 0.0 16.8 26.0 10.0 0.0 ... 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
08-2020 0.0 8.0 15.0 9.0 8.0 13.0 0.0 12.0 0.0 0.0 ... 0.0
09-2020 49.0 15.0 7.0 9.0 5.0 22.0 0.0 0.0 5.0 3.0 ... 0.0
10-2020 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
11-2020 32.0 4.0 26.0 20.0 10.0 0.0 10.0 0.0 0.0 0.0 ... 0.0
12-2020 0.0 0.0 3.0 2.0 0.0 8.0 6.0 23.0 3.0 0.0 ... 0.0
23 24 25 26 27 28 29 30 31
0
01-1988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
02-1988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN
03-1988 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
04-1988 0.0 35.3 48.0 0.0 4.0 0.0 0.0 0.0 NaN
01-1990 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ...
08-2020 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 36.0
09-2020 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
10-2020 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11-2020 0.0 0.0 0.0 0.0 32.0 2.0 5.0 0.0 NaN
12-2020 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[199 rows x 31 columns]
# flatten df into 1-d array
flattened = df.to_numpy().flatten()
# expand months in df into days
valid_days = [pd.date_range(start=m, periods=pd.Period(m).days_in_month, freq='D')
for m in df.index]
# create datetime index by days
valid_days = pd.to_datetime([item for sublist in valid_days
for item in sublist]).strftime('%d-%m-%Y')
# create new DataFrame from flattened values and daily index
df_days = pd.DataFrame(flattened[~np.isnan(flattened)],
index=valid_days,
columns=['value'])
# create the range of all days within the period
total_days = pd.date_range('01-01-1988', '31-12-2020', freq='D').strftime('%d-%m-%Y')
# reindex df_days with all days within the period
df_days = df_days.reindex(total_days)
print(df_days)
Output:
value
01-01-1988 0.0
02-01-1988 0.0
03-01-1988 0.0
04-01-1988 0.0
05-01-1988 0.0
... ...
27-12-2020 0.0
28-12-2020 0.0
29-12-2020 0.0
30-12-2020 0.0
31-12-2020 0.0
12054 rows × 1 columns
I would be glad to know how to solve this in a much more elegant way.