Fixing dataframe with two data formats for rows-CodePudding

I have data that is in this inconvenient format. Simple reproducible example below:

26/9/21 26/9/21
10:00     Paul
12:00     John
27/9/21 27/9/21
1:00      Ringo

As you can see, the dates have not been entered as a column. Instead, the dates repeat across rows as a "header" row for the rows below it. Each date then has a variable number of data rows beneath it, before the next date "header" row.

The output I would like would be:

26/9/21 10:00     Paul
26/9/21 12:00     John
27/9/21 1:00      Ringo

How can I do this in Python and Pandas?

Code for data entry below:

import pandas as pd

df = pd.DataFrame({'a': ['26/9/21', '10:00', '12:00', '27/9/21', '1:00'], 
                   'b': ['26/9/21', 'Paul', 'John', '27/9/21', 'Ringo']})

df

CodePudding user response：

Convert your column a to datetime with errors='coerce' then fill forward. Now you can add the time offset rows.

sra = pd.to_datetime(df['a'], format='%d/%m/%y', errors='coerce')
msk = sra.isnull()
sra = sra.ffill()   pd.to_timedelta(df.loc[msk, 'a']   ':00')

out = pd.merge(sra[msk], df['b'], left_index=True, right_index=True)

>>> out
                    a      b
1 2021-09-26 10:00:00   John
2 2021-09-26 12:00:00   Paul
4 2021-09-27 01:00:00  Ringo

Step by step:

>>> sra = pd.to_datetime(df['a'], format='%d/%m/%y', errors='coerce')
0   2021-09-26
1          NaT
2          NaT
3   2021-09-27
4          NaT
Name: a, dtype: datetime64[ns]

>>> msk = sra.isnull()
0    False
1     True
2     True
3    False
4     True
Name: a, dtype: bool

>>> sra = sra.ffill()   pd.to_timedelta(df.loc[msk, 'a']   ':00')
0                   NaT
1   2021-09-26 10:00:00
2   2021-09-26 12:00:00
3                   NaT
4   2021-09-27 01:00:00
Name: a, dtype: datetime64[ns]

>>> out = pd.merge(sra[msk], df['b'], left_index=True, right_index=True)
                    a      b
1 2021-09-26 10:00:00   John
2 2021-09-26 12:00:00   Paul
4 2021-09-27 01:00:00  Ringo

CodePudding user response：

Following is simple to understand code, reading original dataframe row by row and creating a new dataframe:

df = pd.DataFrame({'a': ['26/9/21', '10:00', '12:00', '27/9/21', '1:00'], 
                   'b': ['26/9/21', 'Paul', 'John', '27/9/21', 'Ringo']})
dflen = len(df)
newrow = []; newdata = []
for i in range(dflen):              # read each row one by one
    if '/' in df.iloc[i,0]:         # if date found
        item0 = df.iloc[i,0]            # get new date
        newrow = [item0]                # put date as first entry of new row
        continue                        # go to next row
    newrow.append(df.iloc[i,0])     # add time
    newrow.append(df.iloc[i,1])     # add name
    newdata.append(newrow)          # add row to new data 
    newrow = [item0]                # create new row with same date entry

newdf = pd.DataFrame(newdata, columns=['Date','Time','Name'])  # create new dataframe; 
print(newdf)

Output:

      Date   Time   Name
0  26/9/21  10:00   Paul
1  26/9/21  12:00   John
2  27/9/21   1:00  Ringo