I have started learning about pandas
quite recently. I have a dataframe of the following form:
From Date To Date PM2.5 PM10 NO NO2
0 01-03-2017 00:00 01-03-2017 00:30 12.04 10.61 15.12 5.81
1 01-03-2017 00:30 01-03-2017 01:00 9.26 9.5 17.11 6.74
2 01-03-2017 01:00 01-03-2017 01:30 8.68 7.93 18.36 7.51
3 01-03-2017 01:30 01-03-2017 02:00 7.57 7.22 19.63 7.7
.
.
.
100 01-01-2018 00:00 01-01-2018 00:30 7.32 6.34 21.01 8.44
,,,,,
<Garbage row 1>, <val>, <val>
<Garbage row 2>, <val>, <val>
From Date To Date Ozone CO2 NOx SO2
0 01-03-2017 00:00 01-03-2017 00:30 12.04 10.61 15.12 5.81
1 01-03-2017 00:30 01-03-2017 01:00 9.26 9.5 17.11 6.74
2 01-03-2017 01:00 01-03-2017 01:30 8.68 7.93 18.36 7.51
3 01-03-2017 01:30 01-03-2017 02:00 7.57 7.22 19.63 7.7
.
.
.
100 01-01-2018 00:00 01-01-2018 00:30 7.32 6.34 21.01 8.44
,,,,,
<Garbage row 1>, <val>, <val>
<Garbage row 2>, <val>, <val>
.
.
.
And this "structure" repeats. Notice that the dates of structures are well aligned. First I want to delete the garbage rows and then on the basis of the dates, I would like to remove the latter rows and instead make them columns. For example, the output should be:
From Date To Date PM2.5 PM10 NO NO2 Ozone CO2 Nox SO2 ...
0 01-03-2017 00:00 01-03-2017 00:30 12.04 10.61 15.12 5.81 12.04 10.61 15.12 5.81 ...
1 01-03-2017 00:30 01-03-2017 01:00 9.26 9.5 17.11 6.74 9.26 9.5 17.11 6.74 ...
2 01-03-2017 01:00 01-03-2017 01:30 8.68 7.93 18.36 7.51 8.68 7.93 18.36 7.51 ...
3 01-03-2017 01:30 01-03-2017 02:00 7.57 7.22 19.63 7.7 7.57 7.22 19.63 7.7 ...
.
.
.
100 01-01-2018 00:00 01-01-2018 00:30 7.32 6.34 21.01 8.44 7.32 6.34 21.01 8.44 ...
I thought about doing this using basic string manipulations but it was turning out to be quite hard and I am not aware of how to do this using pandas
. It would be great if there was any built-in functionality in pandas to do the same.
CodePudding user response:
With the following CSV file:
df = pd.read_csv("file.csv", sep=";", on_bad_lines="skip")
print(df)
# Output
From Date To Date PM2.5 PM10 NO NO2
0 0 01-03-2017 00:00 01-03-2017 00:30 12.04 10.61 15.12 5.81
1 1 01-03-2017 00:30 01-03-2017 01:00 9.26 9.5 17.11 6.74
2 2 01-03-2017 01:00 01-03-2017 01:30 8.68 7.93 18.36 7.51
3 3 01-03-2017 01:30 01-03-2017 02:00 7.57 7.22 19.63 7.7
4 ,,,,, NaN NaN NaN NaN
5 <Garbage row 1>, <val>, <val> NaN NaN NaN NaN
6 <Garbage row 2>, <val>, <val> NaN NaN NaN NaN
7 From Date To Date Ozone CO2 NOx SO2
8 0 01-03-2017 00:00 01-03-2017 00:30 12.04 10.61 15.12 5.81
9 1 01-03-2017 00:30 01-03-2017 01:00 9.26 9.5 17.11 6.74
10 2 01-03-2017 01:00 01-03-2017 01:30 8.68 7.93 18.36 7.51
11 3 01-03-2017 01:30 01-03-2017 02:00 7.57 7.22 19.63 7.7
12 ,,,,, NaN NaN NaN NaN
13 <Garbage row 1>, <val>, <val> NaN NaN NaN NaN
14 <Garbage row 2>, <val>, <val> NaN NaN NaN NaN
15 . NaN NaN NaN NaN
16 . NaN NaN NaN NaN
17 . NaN NaN NaN NaN
Here is one way to do it:
# Cleanup
df = df.dropna(how="any").reset_index(drop=True)
# Deal with the merged column
df["From Date To Date"] = df["From Date To Date"].str.split(" ")
df["From Date"] = df["From Date To Date"].apply(lambda x: x[1])
df["To Date"] = df["From Date To Date"].apply(lambda x: x[-1])
df = df.drop(columns="From Date To Date")
# Separate the data in different dataframes
dfs = []
start = 0
for idx in df[df["To Date"].str.contains("To Date")].index:
dfs.append(df.iloc[start:idx, :].reset_index(drop=True))
start = start idx
else:
dfs.append(df.iloc[start:, :].reset_index(drop=True))
# Get rid of useless rows and columns
for i, df_ in enumerate(dfs[1:]):
temp = df_.loc[0, :]
df_.columns = temp
dfs[i 1] = df_.iloc[1:, 0:4].reset_index(drop=True)
# Merge back all the dataframes and reorder columns
df = pd.concat(dfs, axis=1)
df = df.reindex(
columns=["From Date", "To Date"]
[col for col in df.columns if col not in ["From Date", "To Date"]]
)
Finally:
print(df)
# Output
From Date To Date PM2.5 PM10 NO NO2 Ozone \
0 01-03-2017 00:00 01-03-2017 00:30 12.04 10.61 15.12 5.81 12.04
1 01-03-2017 00:30 01-03-2017 01:00 9.26 9.5 17.11 6.74 9.26
2 01-03-2017 01:00 01-03-2017 01:30 8.68 7.93 18.36 7.51 8.68
3 01-03-2017 01:30 01-03-2017 02:00 7.57 7.22 19.63 7.7 7.57
CO2 NOx SO2
0 10.61 15.12 5.81
1 9.5 17.11 6.74
2 7.93 18.36 7.51
3 7.22 19.63 7.7