I am using Python in CoLab and I am trying to find something that will allow me to move any cells from a subset of a data frame into a new/different column in the same data frame OR sort the cells of the dataframe into the correct columns.
The original column in the CSV looked like this:
and using
Users[['Motorbike', 'Car', 'Bus', 'Train', 'Tram', 'Taxi']] = Users['What distance did you travel in the last month by:'].str.split(',', expand=True)
I was able to split the column into 6 new series to give this
However, now I would like all the cells with 'Motorbike' in the motorbike column, all the cells wih 'Car' in the Car column and so on, without overwriting any other cells OR if this cannot be done, to just assign any occurances of Motorbike, Car etc into the new columns 'Motorbike1', 'Car1' etc. that I have added to the dataframe as shown below. Can anyone help please? new columns
I have tried to copy the cells in original columns to the new columns and then get rid of values containing say not 'Car' However repeating for the next original column into the same first new column it overwrites. There are no repeats of any mode of transport in any row. i.e there is only one or less occurrence of each mode of transport in every row.
CodePudding user response:
Use list comprehension with split for dictionaries, then pass to DataFrame constructor:
L = [dict([y.split() for y in x.split(',')])
for x in df['What distance did you travel in the last month by:']]
df = pd.DataFrame(L)
print (df)
Taxi Motorbike Car Train Bus Tram
0 (km)(20) NaN NaN NaN NaN NaN
1 NaN (km)(500) (km)(500) NaN NaN NaN
2 NaN NaN (km)(1000) NaN NaN NaN
3 NaN NaN (km)(100) (km)(20) NaN NaN
4 (km)(25) NaN NaN (km)(700) (km)(150) NaN
5 NaN (km)(0) (km)(0) NaN (km)(40) NaN
6 (km)(100) NaN (km)(300) NaN NaN NaN
7 NaN NaN (km)(300) NaN NaN NaN
8 NaN NaN NaN (km)(80) NaN (km)(300)
9 NaN NaN (km)(700) NaN (km)(50) (km)(50)
CodePudding user response:
You can use a regex to extract the xxx (yyy)(yyy)
parts, then reshape:
out = (df['col_name']
.str.extractall(r'([^,] ) (\([^,]*\))')
.set_index(0, append=True)[1]
.droplevel('match')
.unstack(0)
)
output:
Bus Car Motorbike Taxi Train Tram
0 NaN NaN NaN (km)(20) NaN NaN
1 NaN (km)(500) (km)(500) NaN NaN NaN
2 NaN (km)(1000) NaN NaN NaN NaN
3 NaN (km)(100) NaN NaN (km)(20) NaN
4 (km)(150) NaN NaN (km)(25) (km)(700) NaN
5 (km)(40) (km)(0) (km)(0) NaN NaN NaN
6 NaN (km)(300) NaN (km)(100) NaN NaN
7 NaN (km)(300) NaN NaN NaN NaN
8 NaN NaN NaN NaN (km)(80) (km)(300)
9 (km)(50) (km)(700) NaN NaN NaN (km)(50)
If you only need the numbers, you can change the regex:
(df['col_name'].str.extractall(r'([^,] )\s \(km\)\((\d )\)')
.set_index(0, append=True)[1]
.droplevel('match')
.unstack(0).rename_axis(columns=None)
)
Output:
Bus Car Motorbike Taxi Train Tram
0 NaN NaN NaN 20 NaN NaN
1 NaN 500 500 NaN NaN NaN
2 NaN 1000 NaN NaN NaN NaN
3 NaN 100 NaN NaN 20 NaN
4 150 NaN NaN 25 700 NaN
5 40 0 0 NaN NaN NaN
6 NaN 300 NaN 100 NaN NaN
7 NaN 300 NaN NaN NaN NaN
8 NaN NaN NaN NaN 80 300
9 50 700 NaN NaN NaN 50