I have data that is in this inconvenient format. Simple reproducible example below:
26/9/21 26/9/21
10:00 Paul
12:00 John
27/9/21 27/9/21
1:00 Ringo
As you can see, the dates have not been entered as a column. Instead, the dates repeat across rows as a "header" row for the rows below it. Each date then has a variable number of data rows beneath it, before the next date "header" row.
The output I would like would be:
26/9/21 10:00 Paul
26/9/21 12:00 John
27/9/21 1:00 Ringo
How can I do this in Python and Pandas?
Code for data entry below:
import pandas as pd
df = pd.DataFrame({'a': ['26/9/21', '10:00', '12:00', '27/9/21', '1:00'],
'b': ['26/9/21', 'Paul', 'John', '27/9/21', 'Ringo']})
df
CodePudding user response:
Convert your column a
to datetime with errors='coerce'
then fill forward. Now you can add the time offset rows.
sra = pd.to_datetime(df['a'], format='%d/%m/%y', errors='coerce')
msk = sra.isnull()
sra = sra.ffill() pd.to_timedelta(df.loc[msk, 'a'] ':00')
out = pd.merge(sra[msk], df['b'], left_index=True, right_index=True)
>>> out
a b
1 2021-09-26 10:00:00 John
2 2021-09-26 12:00:00 Paul
4 2021-09-27 01:00:00 Ringo
Step by step:
>>> sra = pd.to_datetime(df['a'], format='%d/%m/%y', errors='coerce')
0 2021-09-26
1 NaT
2 NaT
3 2021-09-27
4 NaT
Name: a, dtype: datetime64[ns]
>>> msk = sra.isnull()
0 False
1 True
2 True
3 False
4 True
Name: a, dtype: bool
>>> sra = sra.ffill() pd.to_timedelta(df.loc[msk, 'a'] ':00')
0 NaT
1 2021-09-26 10:00:00
2 2021-09-26 12:00:00
3 NaT
4 2021-09-27 01:00:00
Name: a, dtype: datetime64[ns]
>>> out = pd.merge(sra[msk], df['b'], left_index=True, right_index=True)
a b
1 2021-09-26 10:00:00 John
2 2021-09-26 12:00:00 Paul
4 2021-09-27 01:00:00 Ringo
CodePudding user response:
Following is simple to understand code, reading original dataframe row by row and creating a new dataframe:
df = pd.DataFrame({'a': ['26/9/21', '10:00', '12:00', '27/9/21', '1:00'],
'b': ['26/9/21', 'Paul', 'John', '27/9/21', 'Ringo']})
dflen = len(df)
newrow = []; newdata = []
for i in range(dflen): # read each row one by one
if '/' in df.iloc[i,0]: # if date found
item0 = df.iloc[i,0] # get new date
newrow = [item0] # put date as first entry of new row
continue # go to next row
newrow.append(df.iloc[i,0]) # add time
newrow.append(df.iloc[i,1]) # add name
newdata.append(newrow) # add row to new data
newrow = [item0] # create new row with same date entry
newdf = pd.DataFrame(newdata, columns=['Date','Time','Name']) # create new dataframe;
print(newdf)
Output:
Date Time Name
0 26/9/21 10:00 Paul
1 26/9/21 12:00 John
2 27/9/21 1:00 Ringo