Home > Mobile >  How to relocate different data that is in a single column to their respective columns?
How to relocate different data that is in a single column to their respective columns?

Time:09-20

I have a dataframe whose data are strings and different information are mixed in a single column. Like this:

0 Place: House
1 Date/Time: 01/02/03 at 09:30
2 Color:Yellow
3 Place: Street
4 Date/Time: 12/12/13 at 13:21:21
5 Color:Red
df = pd.DataFrame(['Place: House','Date/Time: 01/02/03 at 09:30', 'Color:Yellow', 'Place: Street','Date/Time: 21/12/13 at 13:21:21', 'Color:Red'])

I need the dataframe like this:

Place Date/Time Color
House 01/02/03 Yellow
Street 21/12/13 Red

I started by converting the excel file to csv, and then I tried to open it as follows:

df = pd.read_csv(filename, sep=":")

I tried using the ":" to separate the columns, but the time formatting also uses ":", so it didn't work. The time is not important information so I even tried to delete it and keep the date, but I couldn't find a way that wouldn't affect the other information in the column either.

CodePudding user response:

Given the values in your data, you will need to limit the split to just happen once, which you can do with n parameter of split. You can expand the split values into two columns then pivot.

The trick here is to create a grouping by taking the df.index // 3 as the index, so that every 3 lines is in a new group.

df = pd.DataFrame(['Place: House','Date/Time: 01/02/03 at 09:30', 'Color:Yellow', 'Place: Street','Date/Time: 21/12/13 at 13:21:21', 'Color:Red'])

df = df[0].str.split(':', n=1, expand=True)
df['idx'] = df.index//3
df.pivot(index='idx', columns=0, values=1).reset_index().drop(columns='idx')[['Place','Date/Time','Color']]

Output

0    Place              Date/Time   Color
0    House      01/02/03 at 09:30  Yellow
1   Street   21/12/13 at 13:21:21     Red

CodePudding user response:

Your data is all strings, IMO you are likely to get better performance wrangling it within vanilla python, before bringing it back into Pandas; the only time you are likely to get better performance for strings in Pandas is if you are using the pyarrow string data type.

from collections import defaultdict
out = df.squeeze().tolist() # this works since it is just one column
frame = defaultdict(list)
for entry in out:
    key, value = entry.split(':', maxsplit=1)
    if key == "Date/Time":
        value = value.split('at')[0]
    value = value.strip()
    key = key.strip() # not really necessary
    frame[key].append(value)
pd.DataFrame(frame)
    Place Date/Time   Color
0   House  01/02/03  Yellow
1  Street  21/12/13     Red
  • Related