How to relocate different data that is in a single column to their respective columns?-CodePudding

I have a dataframe whose data are strings and different information are mixed in a single column. Like this:


0	Place: House
1	Date/Time: 01/02/03 at 09:30
2	Color:Yellow
3	Place: Street
4	Date/Time: 12/12/13 at 13:21:21
5	Color:Red

df = pd.DataFrame(['Place: House','Date/Time: 01/02/03 at 09:30', 'Color:Yellow', 'Place: Street','Date/Time: 21/12/13 at 13:21:21', 'Color:Red'])

I need the dataframe like this:

Place	Date/Time	Color
House	01/02/03	Yellow
Street	21/12/13	Red

I started by converting the excel file to csv, and then I tried to open it as follows:

df = pd.read_csv(filename, sep=":")

I tried using the ":" to separate the columns, but the time formatting also uses ":", so it didn't work. The time is not important information so I even tried to delete it and keep the date, but I couldn't find a way that wouldn't affect the other information in the column either.

CodePudding user response：

Given the values in your data, you will need to limit the split to just happen once, which you can do with n parameter of split. You can expand the split values into two columns then pivot.

The trick here is to create a grouping by taking the df.index // 3 as the index, so that every 3 lines is in a new group.

df = pd.DataFrame(['Place: House','Date/Time: 01/02/03 at 09:30', 'Color:Yellow', 'Place: Street','Date/Time: 21/12/13 at 13:21:21', 'Color:Red'])

df = df[0].str.split(':', n=1, expand=True)
df['idx'] = df.index//3
df.pivot(index='idx', columns=0, values=1).reset_index().drop(columns='idx')[['Place','Date/Time','Color']]

Output

0    Place              Date/Time   Color
0    House      01/02/03 at 09:30  Yellow
1   Street   21/12/13 at 13:21:21     Red

CodePudding user response：

Your data is all strings, IMO you are likely to get better performance wrangling it within vanilla python, before bringing it back into Pandas; the only time you are likely to get better performance for strings in Pandas is if you are using the pyarrow string data type.

from collections import defaultdict
out = df.squeeze().tolist() # this works since it is just one column
frame = defaultdict(list)
for entry in out:
    key, value = entry.split(':', maxsplit=1)
    if key == "Date/Time":
        value = value.split('at')[0]
    value = value.strip()
    key = key.strip() # not really necessary
    frame[key].append(value)
pd.DataFrame(frame)
    Place Date/Time   Color
0   House  01/02/03  Yellow
1  Street  21/12/13     Red