I have a dataframe whose data are strings and different information are mixed in a single column. Like this:
0 | Place: House |
1 | Date/Time: 01/02/03 at 09:30 |
2 | Color:Yellow |
3 | Place: Street |
4 | Date/Time: 12/12/13 at 13:21:21 |
5 | Color:Red |
df = pd.DataFrame(['Place: House','Date/Time: 01/02/03 at 09:30', 'Color:Yellow', 'Place: Street','Date/Time: 21/12/13 at 13:21:21', 'Color:Red'])
I need the dataframe like this:
Place | Date/Time | Color |
---|---|---|
House | 01/02/03 | Yellow |
Street | 21/12/13 | Red |
I started by converting the excel file to csv, and then I tried to open it as follows:
df = pd.read_csv(filename, sep=":")
I tried using the ":" to separate the columns, but the time formatting also uses ":", so it didn't work. The time is not important information so I even tried to delete it and keep the date, but I couldn't find a way that wouldn't affect the other information in the column either.
CodePudding user response:
Given the values in your data, you will need to limit the split to just happen once, which you can do with n
parameter of split. You can expand the split values into two columns then pivot.
The trick here is to create a grouping by taking the df.index // 3
as the index, so that every 3 lines is in a new group.
df = pd.DataFrame(['Place: House','Date/Time: 01/02/03 at 09:30', 'Color:Yellow', 'Place: Street','Date/Time: 21/12/13 at 13:21:21', 'Color:Red'])
df = df[0].str.split(':', n=1, expand=True)
df['idx'] = df.index//3
df.pivot(index='idx', columns=0, values=1).reset_index().drop(columns='idx')[['Place','Date/Time','Color']]
Output
0 Place Date/Time Color
0 House 01/02/03 at 09:30 Yellow
1 Street 21/12/13 at 13:21:21 Red
CodePudding user response:
Your data is all strings, IMO you are likely to get better performance wrangling it within vanilla python, before bringing it back into Pandas; the only time you are likely to get better performance for strings in Pandas is if you are using the pyarrow string data type.
from collections import defaultdict
out = df.squeeze().tolist() # this works since it is just one column
frame = defaultdict(list)
for entry in out:
key, value = entry.split(':', maxsplit=1)
if key == "Date/Time":
value = value.split('at')[0]
value = value.strip()
key = key.strip() # not really necessary
frame[key].append(value)
pd.DataFrame(frame)
Place Date/Time Color
0 House 01/02/03 Yellow
1 Street 21/12/13 Red