I have half written a code and got stuck at 2nd half.
I have pulled info from a text doc and I have placed the info into pandas dataset column with data like
Job : one
Time :17:00
Date : Tuesday
Cost:200
Job:Two
Time :18:00
Location :England
Job :Three
Time :12:00
Date :Monday
Location :England
Cost:1200
I have pulled all unique categories into another dataset and made them into columns
Job|Time|Date|Location|Cost
Now I want to cycle through the 1st dataset and fill out Each new job per row into the relevant columns, happy for the columns that don't have data to be blank...
Job|Time|Date|Location|Cost
One|17:00|Tuesday| |200
Two|18:00| |England|
Three|12:00|Monday|England|1200
This is example. I actually have 33 unique categories.
I have no idea how to start coding this part (or if it's possible). New to scripting so been happy to get this far..
I am using pandas as putting info into tables helps my brain because I am pulling info from a text doc and python likes to squash it all up but not sure if using pandas is limiting my options.
Any help /tips appreciated
Not sure if helps but my code so far
CodePudding user response:
You can split data
column by the first occurrence of ;
into two columns, and group by the Job
delimiter then transpose each group.
df[['key', 'val']] = df['data'].str.split(':', 1, expand=True)
df['key'] = df['key'].str.strip()
df['val'] = df['val'].str.strip()
m = (df['key'] == 'Job').cumsum()
df_ = df[['key', 'val']].groupby(m).apply(lambda group: group.set_index('key').T).reset_index(drop=True)
print(df_)
key Job Time Date Cost Location
0 one 17:00 Tuesday 200 NaN
1 Two 18:00 NaN NaN England
2 Three 12:00 Monday 1200 England
CodePudding user response:
You can split
, compute an index with cumsum
and pivot
.
(df['col']
.str.split(r'\s*\:\s*', n=1, expand=True)
.assign(index=lambda d: d[0].eq('Job').cumsum())
.pivot('index', 0, 1)
# lines below are optional
.reset_index(drop=True)
.rename_axis(columns=None)
.reindex(columns=['Job','Time','Date','Location','Cost'])
)
Output:
Job Time Date Location Cost
0 one 17:00 Tuesday NaN 200
1 Two 18:00 NaN England NaN
2 Three 12:00 Monday England 1200
The other (optional) lines of code are just to clean up the output (axes names, index, order of columns...). I used reindex
to reorder the columns to prevent an error if a label is missing, but if you are sure the names are present, a simple slice would work: [['Job','Time','Date','Location','Cost']]