Python pandas - can I sort 1 column dataset into rows of matching data in another dataset-CodePudding

I have half written a code and got stuck at 2nd half.

I have pulled info from a text doc and I have placed the info into pandas dataset column with data like

Job : one
Time :17:00
Date : Tuesday
Cost:200
Job:Two
Time :18:00
Location :England
Job :Three
Time :12:00
Date :Monday
Location :England
Cost:1200

I have pulled all unique categories into another dataset and made them into columns

Job|Time|Date|Location|Cost

Now I want to cycle through the 1st dataset and fill out Each new job per row into the relevant columns, happy for the columns that don't have data to be blank...

Job|Time|Date|Location|Cost
One|17:00|Tuesday|    |200
Two|18:00|       |England|
Three|12:00|Monday|England|1200

This is example. I actually have 33 unique categories.

I have no idea how to start coding this part (or if it's possible). New to scripting so been happy to get this far..

I am using pandas as putting info into tables helps my brain because I am pulling info from a text doc and python likes to squash it all up but not sure if using pandas is limiting my options.

Any help /tips appreciated

Not sure if helps but my code so far

CodePudding user response：

You can split data column by the first occurrence of ; into two columns, and group by the Job delimiter then transpose each group.

df[['key', 'val']] = df['data'].str.split(':', 1, expand=True)
df['key'] = df['key'].str.strip()
df['val'] = df['val'].str.strip()

m = (df['key'] == 'Job').cumsum()

df_ = df[['key', 'val']].groupby(m).apply(lambda group: group.set_index('key').T).reset_index(drop=True)

print(df_)

key    Job   Time     Date  Cost Location
0      one  17:00  Tuesday   200      NaN
1      Two  18:00      NaN   NaN  England
2    Three  12:00   Monday  1200  England

CodePudding user response：

You can split, compute an index with cumsum and pivot.

(df['col']
 .str.split(r'\s*\:\s*', n=1, expand=True)
 .assign(index=lambda d: d[0].eq('Job').cumsum())
 .pivot('index', 0, 1)
 # lines below are optional
 .reset_index(drop=True)
 .rename_axis(columns=None)
 .reindex(columns=['Job','Time','Date','Location','Cost'])
 )

Output:

     Job   Time     Date Location  Cost
0    one  17:00  Tuesday      NaN   200
1    Two  18:00      NaN  England   NaN
2  Three  12:00   Monday  England  1200

The other (optional) lines of code are just to clean up the output (axes names, index, order of columns...). I used reindex to reorder the columns to prevent an error if a label is missing, but if you are sure the names are present, a simple slice would work: [['Job','Time','Date','Location','Cost']]