How do I split rows to add more columns in pandas dataframe?-CodePudding

I have started learning about pandas quite recently. I have a dataframe of the following form:

          From Date           To Date  PM2.5   PM10     NO   NO2
0  01-03-2017 00:00  01-03-2017 00:30  12.04  10.61  15.12  5.81
1  01-03-2017 00:30  01-03-2017 01:00   9.26    9.5  17.11  6.74  
2  01-03-2017 01:00  01-03-2017 01:30   8.68   7.93  18.36  7.51  
3  01-03-2017 01:30  01-03-2017 02:00   7.57   7.22  19.63   7.7   
.
.
.
100  01-01-2018 00:00  01-01-2018 00:30   7.32   6.34  21.01  8.44
,,,,,
<Garbage row 1>, <val>, <val> 
<Garbage row 2>, <val>, <val>
          From Date           To Date  Ozone    CO2    NOx   SO2
0  01-03-2017 00:00  01-03-2017 00:30  12.04  10.61  15.12  5.81
1  01-03-2017 00:30  01-03-2017 01:00   9.26    9.5  17.11  6.74  
2  01-03-2017 01:00  01-03-2017 01:30   8.68   7.93  18.36  7.51  
3  01-03-2017 01:30  01-03-2017 02:00   7.57   7.22  19.63   7.7   
.
.
.
100  01-01-2018 00:00  01-01-2018 00:30   7.32   6.34  21.01  8.44 
,,,,,
<Garbage row 1>, <val>, <val> 
<Garbage row 2>, <val>, <val>
.
.
.

And this "structure" repeats. Notice that the dates of structures are well aligned. First I want to delete the garbage rows and then on the basis of the dates, I would like to remove the latter rows and instead make them columns. For example, the output should be:

          From Date           To Date  PM2.5   PM10     NO   NO2  Ozone   CO2   Nox   SO2 ...
0  01-03-2017 00:00  01-03-2017 00:30  12.04  10.61  15.12  5.81  12.04  10.61  15.12  5.81  ...
1  01-03-2017 00:30  01-03-2017 01:00   9.26    9.5  17.11  6.74   9.26    9.5  17.11  6.74  ...
2  01-03-2017 01:00  01-03-2017 01:30   8.68   7.93  18.36  7.51   8.68   7.93  18.36  7.51  ...
3  01-03-2017 01:30  01-03-2017 02:00   7.57   7.22  19.63   7.7   7.57   7.22  19.63   7.7  ...
.
.
.
100  01-01-2018 00:00  01-01-2018 00:30   7.32   6.34  21.01  8.44   7.32   6.34  21.01  8.44  ...

I thought about doing this using basic string manipulations but it was turning out to be quite hard and I am not aware of how to do this using pandas. It would be great if there was any built-in functionality in pandas to do the same.

CodePudding user response：

With the following CSV file:

df = pd.read_csv("file.csv", sep=";", on_bad_lines="skip")

print(df)
# Output
              From Date           To Date  PM2.5   PM10     NO   NO2
0   0  01-03-2017 00:00  01-03-2017 00:30  12.04  10.61  15.12  5.81
1   1  01-03-2017 00:30  01-03-2017 01:00   9.26    9.5  17.11  6.74
2   2  01-03-2017 01:00  01-03-2017 01:30   8.68   7.93  18.36  7.51
3   3  01-03-2017 01:30  01-03-2017 02:00   7.57   7.22  19.63   7.7
4                                   ,,,,,    NaN    NaN    NaN   NaN
5           <Garbage row 1>, <val>, <val>    NaN    NaN    NaN   NaN
6           <Garbage row 2>, <val>, <val>    NaN    NaN    NaN   NaN
7             From Date           To Date  Ozone    CO2    NOx   SO2
8   0  01-03-2017 00:00  01-03-2017 00:30  12.04  10.61  15.12  5.81
9   1  01-03-2017 00:30  01-03-2017 01:00   9.26    9.5  17.11  6.74
10  2  01-03-2017 01:00  01-03-2017 01:30   8.68   7.93  18.36  7.51
11  3  01-03-2017 01:30  01-03-2017 02:00   7.57   7.22  19.63   7.7
12                                  ,,,,,    NaN    NaN    NaN   NaN
13          <Garbage row 1>, <val>, <val>    NaN    NaN    NaN   NaN
14          <Garbage row 2>, <val>, <val>    NaN    NaN    NaN   NaN
15                                      .    NaN    NaN    NaN   NaN
16                                      .    NaN    NaN    NaN   NaN
17                                      .    NaN    NaN    NaN   NaN

Here is one way to do it:

# Cleanup
df = df.dropna(how="any").reset_index(drop=True)

# Deal with the merged column
df["From Date           To Date"] = df["From Date           To Date"].str.split("  ")
df["From Date"] = df["From Date           To Date"].apply(lambda x: x[1])
df["To Date"] = df["From Date           To Date"].apply(lambda x: x[-1])
df = df.drop(columns="From Date           To Date")

# Separate the data in different dataframes
dfs = []
start = 0
for idx in df[df["To Date"].str.contains("To Date")].index:
    dfs.append(df.iloc[start:idx, :].reset_index(drop=True))
    start = start   idx
else:
    dfs.append(df.iloc[start:, :].reset_index(drop=True))

# Get rid of useless rows and columns
for i, df_ in enumerate(dfs[1:]):
    temp = df_.loc[0, :]
    df_.columns = temp
    dfs[i   1] = df_.iloc[1:, 0:4].reset_index(drop=True)

# Merge back all the dataframes and reorder columns
df = pd.concat(dfs, axis=1)
df = df.reindex(
    columns=["From Date", "To Date"]
      [col for col in df.columns if col not in ["From Date", "To Date"]]
)

Finally:

print(df)
# Output
          From Date           To Date  PM2.5   PM10     NO   NO2  Ozone  \
0  01-03-2017 00:00  01-03-2017 00:30  12.04  10.61  15.12  5.81  12.04   
1  01-03-2017 00:30  01-03-2017 01:00   9.26    9.5  17.11  6.74   9.26   
2  01-03-2017 01:00  01-03-2017 01:30   8.68   7.93  18.36  7.51   8.68   
3  01-03-2017 01:30  01-03-2017 02:00   7.57   7.22  19.63   7.7   7.57   

     CO2    NOx   SO2  
0  10.61  15.12  5.81  
1    9.5  17.11  6.74  
2   7.93  18.36  7.51  
3   7.22  19.63   7.7