How to pull any cells from a table/dataframe into a column if they contain specific string?-CodePudding

I am using Python in CoLab and I am trying to find something that will allow me to move any cells from a subset of a data frame into a new/different column in the same data frame OR sort the cells of the dataframe into the correct columns.

The original column in the CSV looked like this:

and using

Users[['Motorbike', 'Car', 'Bus', 'Train', 'Tram', 'Taxi']] = Users['What distance did you travel in the last month by:'].str.split(',', expand=True)

I was able to split the column into 6 new series to give this

However, now I would like all the cells with 'Motorbike' in the motorbike column, all the cells wih 'Car' in the Car column and so on, without overwriting any other cells OR if this cannot be done, to just assign any occurances of Motorbike, Car etc into the new columns 'Motorbike1', 'Car1' etc. that I have added to the dataframe as shown below. Can anyone help please? new columns

I have tried to copy the cells in original columns to the new columns and then get rid of values containing say not 'Car' However repeating for the next original column into the same first new column it overwrites. There are no repeats of any mode of transport in any row. i.e there is only one or less occurrence of each mode of transport in every row.

CodePudding user response：

Use list comprehension with split for dictionaries, then pass to DataFrame constructor:

L = [dict([y.split() for y in x.split(',')])
      for x in df['What distance did you travel in the last month by:']]

df = pd.DataFrame(L)
print (df)
        Taxi  Motorbike         Car      Train        Bus       Tram
0   (km)(20)        NaN         NaN        NaN        NaN        NaN
1        NaN  (km)(500)   (km)(500)        NaN        NaN        NaN
2        NaN        NaN  (km)(1000)        NaN        NaN        NaN
3        NaN        NaN   (km)(100)   (km)(20)        NaN        NaN
4   (km)(25)        NaN         NaN  (km)(700)  (km)(150)        NaN
5        NaN    (km)(0)     (km)(0)        NaN   (km)(40)        NaN
6  (km)(100)        NaN   (km)(300)        NaN        NaN        NaN
7        NaN        NaN   (km)(300)        NaN        NaN        NaN
8        NaN        NaN         NaN   (km)(80)        NaN  (km)(300)
9        NaN        NaN   (km)(700)        NaN   (km)(50)   (km)(50)

CodePudding user response：

You can use a regex to extract the xxx (yyy)(yyy) parts, then reshape:

out = (df['col_name']
 .str.extractall(r'([^,] ) (\([^,]*\))')
 .set_index(0, append=True)[1]
 .droplevel('match')
 .unstack(0)
)

output:

         Bus         Car  Motorbike       Taxi      Train       Tram
0        NaN         NaN        NaN   (km)(20)        NaN        NaN
1        NaN   (km)(500)  (km)(500)        NaN        NaN        NaN
2        NaN  (km)(1000)        NaN        NaN        NaN        NaN
3        NaN   (km)(100)        NaN        NaN   (km)(20)        NaN
4  (km)(150)         NaN        NaN   (km)(25)  (km)(700)        NaN
5   (km)(40)     (km)(0)    (km)(0)        NaN        NaN        NaN
6        NaN   (km)(300)        NaN  (km)(100)        NaN        NaN
7        NaN   (km)(300)        NaN        NaN        NaN        NaN
8        NaN         NaN        NaN        NaN   (km)(80)  (km)(300)
9   (km)(50)   (km)(700)        NaN        NaN        NaN   (km)(50)

If you only need the numbers, you can change the regex:

(df['col_name'].str.extractall(r'([^,] )\s \(km\)\((\d )\)')
 .set_index(0, append=True)[1]
 .droplevel('match')
 .unstack(0).rename_axis(columns=None)
)

Output:

   Bus   Car Motorbike Taxi Train Tram
0  NaN   NaN       NaN   20   NaN  NaN
1  NaN   500       500  NaN   NaN  NaN
2  NaN  1000       NaN  NaN   NaN  NaN
3  NaN   100       NaN  NaN    20  NaN
4  150   NaN       NaN   25   700  NaN
5   40     0         0  NaN   NaN  NaN
6  NaN   300       NaN  100   NaN  NaN
7  NaN   300       NaN  NaN   NaN  NaN
8  NaN   NaN       NaN  NaN    80  300
9   50   700       NaN  NaN   NaN   50