I have a pandas dataframe where some of the rows have multiple entries. I would like to match up a list I have to the third column. I have tried different things, but it isn't working for some reason.
Current df
username_list= ["charles23", "ems12", "", "sam34", "jon134", "", "jy19"]
ID1 ID2 passcode
01 01 Charlie233, Emily13
01 02
01 03 Sam310, John12
01 04
01 05 Jake42
Desired df
ID1 ID2 passcode username
01 01 Charlie233, Emily13 charles23, ems12
01 02
01 03 Sam310, John12 sam34, jon134
01 04
01 05 Jake42 jy19
What I tried
df = df.assign(passcode = df["passcode"].str.split(",")).explode(column="passcode").assign(username=username_list).groupby(["ID1", "ID2"])["passcode", "username"].agg(list)
df.assign(
passcode=df["passcode"].apply(lambda x: ", ".join(x) if x else ""),
username=df["username"].apply(lambda x: ", ".join(x))
).reset_index()
ValueError: Length of values (1000) does not match length of index (1008)
I don't know why this error keeps happening given that I checked len(username_list) == len(df["passcode"])
CodePudding user response:
You can do:
df['pl'] = df['passcode'].str.split(',').str.len()
df['pi'] = df['pl'].cumsum() - df['pl']
df['username'] = df.apply(lambda x:username_list[x['pi']:x['pi'] x['pl']],
axis=1).str.join(',')
df.drop(['pi', 'pl'], axis=1, inplace=True)
output (print(df)):
ID1 ID2 passcode username
0 1 1 Charlie233, Emily13 charles23,ems12
1 1 2
2 1 3 Sam310, John12 sam34,jon134
3 1 4
4 1 5 Jake42 jy19
Explanation:
Lets look at the username_list
first, what we can see here is it is aligned with the values in passcode
but we need to know where to start and where to stop for each list of words, for that the trick below works:
df['pl'] = df['passcode'].str.split(',').str.len()
df['pi'] = df['pl'].cumsum()-df['pl']
where pl
is the passcode length and pi
indicates where the next passcode starts at.
Then use these to go through your username_list
to slice it and then join the list with ',' using pandas str.join method
df['username'] = df.apply(
lambda x:username_list[x['pi']:x['pi'] x['pl']], axis=1).str.join(',')
Then drop the columns pi
, pl
df.drop(['pi', 'pl'], axis=1, inplace=True)