I have a data frame and in one of the columns, some cells contain one value while others two values, and so on. The values are separated with '-'. I want to take each value, depending on its place in the cell, and put it in a list.
For example:
import pandas as pd
df = pd.DataFrame()
print(df)
df['Name'] = ['Sam', 'Sam-Joe-Ron-Tania', 'Robert-Sam', 'Jack-Daniel-Sam-Joe-Billy-Robert','Billa']
df['IQ'] = [120, 100, 90, 80, 110]
df['Scores'] = [80, 75, 100, 77, 100]
df
I want to separate the names, so that for example, the first list would contain only the first names: ['Sam', 'Sam', 'Robert', 'Jack', 'Billa']
And the second list would have the second names in order : ['Joe', 'Sam', 'Daniel']
How can I do that? Thanks!
CodePudding user response:
new columns
Use regex with str.extract
:
df[['First', 'Second']] = df['Name'].str.extract('([^-] )(?:-([^-] ))?')
or with a subset of str.split
(split
is interesting if you have more than 2 names to extract, else prefer extract
that will be quite more efficient):
N = 2 # number of names to extract
# adapt the assignment below to the number of columns
df[['First', 'Second']] = df['Name'].str.split('-', expand=True, n=N)[range(N)]
output:
Name IQ Scores First Second
0 Sam 120 80 Sam NaN
1 Sam-Joe-Ron-Tania 100 75 Sam Joe
2 Robert-Sam 90 100 Robert Sam
3 Jack-Daniel-Sam-Joe-Billy-Robert 80 77 Jack Daniel
4 Billa 110 100 Billa NaN
python lists
If really you want lists:
d = df['Name'].str.extract('([^-] )(?:-([^-] ))?')
l1 = d[0].dropna().to_list()
# ['Sam', 'Sam', 'Robert', 'Jack', 'Billa']
l2 = d[1].dropna().to_list()
# ['Joe', 'Sam', 'Daniel']
Or in one command:
l1, l2 = (df['Name'].str.extract('([^-] )(?:-([^-] ))?')
.apply(lambda s: s.dropna().to_list())
)
CodePudding user response:
Solutions for generate nested lists by splitted column Name
by Series.str.split
:
L = [[y for y in x if pd.notna(y)] for x in
df['Name'].str.split('-', expand=True).to_numpy().T]
L = df['Name'].str.split('-', expand=True).stack().groupby(level=1).agg(list).tolist()
L = [v.dropna().tolist() for k, v in
df['Name'].str.split('-', expand=True).to_dict('series').items()]
print (L)
[['Sam', 'Sam', 'Robert', 'Jack', 'Billa'],
['Joe', 'Sam', 'Daniel'],
['Ron', 'Sam'],
['Tania', 'Joe'],
['Billy'],
['Robert']]
For select use indexing:
print (L[0])
['Sam', 'Sam', 'Robert', 'Jack', 'Billa']
print (L[1])
['Joe', 'Sam', 'Daniel']
print (L[2])
['Ron', 'Sam']
Details:
df1 = df['Name'].str.split('-', expand=True)
print (df1)
0 1 2 3 4 5
0 Sam None None None None None
1 Sam Joe Ron Tania None None
2 Robert Sam None None None None
3 Jack Daniel Sam Joe Billy Robert
4 Billa None None None None None
For new columns:
df = df.join(df['Name'].str.split('-', expand=True).add_prefix('names'))
print (df)
Name IQ Scores names0 names1 names2 \
0 Sam 120 80 Sam None None
1 Sam-Joe-Ron-Tania 100 75 Sam Joe Ron
2 Robert-Sam 90 100 Robert Sam None
3 Jack-Daniel-Sam-Joe-Billy-Robert 80 77 Jack Daniel Sam
4 Billa 110 100 Billa None None
names3 names4 names5
0 None None None
1 Tania None None
2 None None None
3 Joe Billy Robert
4 None None None