Split cells values and put them into different lists-CodePudding

I have a data frame and in one of the columns, some cells contain one value while others two values, and so on. The values are separated with '-'. I want to take each value, depending on its place in the cell, and put it in a list.

For example:

import pandas as pd
  
df = pd.DataFrame()
  
print(df)
  
df['Name'] = ['Sam', 'Sam-Joe-Ron-Tania', 'Robert-Sam', 'Jack-Daniel-Sam-Joe-Billy-Robert','Billa']
df['IQ'] = [120, 100, 90, 80, 110]
df['Scores'] = [80, 75, 100, 77, 100]

df

I want to separate the names, so that for example, the first list would contain only the first names: ['Sam', 'Sam', 'Robert', 'Jack', 'Billa']

And the second list would have the second names in order : ['Joe', 'Sam', 'Daniel']

How can I do that? Thanks!

CodePudding user response：

new columns

Use regex with str.extract:

df[['First', 'Second']] = df['Name'].str.extract('([^-] )(?:-([^-] ))?')

or with a subset of str.split (split is interesting if you have more than 2 names to extract, else prefer extract that will be quite more efficient):

N = 2 # number of names to extract
# adapt the assignment below to the number of columns
df[['First', 'Second']] = df['Name'].str.split('-', expand=True, n=N)[range(N)]

output:

                               Name   IQ  Scores   First  Second
0                               Sam  120      80     Sam     NaN
1                 Sam-Joe-Ron-Tania  100      75     Sam     Joe
2                        Robert-Sam   90     100  Robert     Sam
3  Jack-Daniel-Sam-Joe-Billy-Robert   80      77    Jack  Daniel
4                             Billa  110     100   Billa     NaN

python lists

If really you want lists:

d = df['Name'].str.extract('([^-] )(?:-([^-] ))?')

l1 = d[0].dropna().to_list()
# ['Sam', 'Sam', 'Robert', 'Jack', 'Billa']

l2 = d[1].dropna().to_list()
# ['Joe', 'Sam', 'Daniel']

Or in one command:

l1, l2 = (df['Name'].str.extract('([^-] )(?:-([^-] ))?')
          .apply(lambda s: s.dropna().to_list())
         )

CodePudding user response：

Solutions for generate nested lists by splitted column Name by Series.str.split:

L = [[y for y in x if pd.notna(y)] for x in 
                      df['Name'].str.split('-', expand=True).to_numpy().T]

L = df['Name'].str.split('-', expand=True).stack().groupby(level=1).agg(list).tolist()

L = [v.dropna().tolist() for k, v in 
         df['Name'].str.split('-', expand=True).to_dict('series').items()]

print (L)
[['Sam', 'Sam', 'Robert', 'Jack', 'Billa'], 
 ['Joe', 'Sam', 'Daniel'], 
 ['Ron', 'Sam'], 
 ['Tania', 'Joe'], 
 ['Billy'], 
 ['Robert']]

For select use indexing:

print (L[0])
['Sam', 'Sam', 'Robert', 'Jack', 'Billa']

print (L[1])
['Joe', 'Sam', 'Daniel']

print (L[2])
['Ron', 'Sam']

Details:

df1 = df['Name'].str.split('-', expand=True)
print (df1)
        0       1     2      3      4       5
0     Sam    None  None   None   None    None
1     Sam     Joe   Ron  Tania   None    None
2  Robert     Sam  None   None   None    None
3    Jack  Daniel   Sam    Joe  Billy  Robert
4   Billa    None  None   None   None    None

For new columns:

df = df.join(df['Name'].str.split('-', expand=True).add_prefix('names'))
print (df)
                               Name   IQ  Scores  names0  names1 names2  \
0                               Sam  120      80     Sam    None   None   
1                 Sam-Joe-Ron-Tania  100      75     Sam     Joe    Ron   
2                        Robert-Sam   90     100  Robert     Sam   None   
3  Jack-Daniel-Sam-Joe-Billy-Robert   80      77    Jack  Daniel    Sam   
4                             Billa  110     100   Billa    None   None   

  names3 names4  names5  
0   None   None    None  
1  Tania   None    None  
2   None   None    None  
3    Joe  Billy  Robert  
4   None   None    None