Home > database >  How to deal with long names in data cleaning?
How to deal with long names in data cleaning?

Time:01-13

I have a users database. I want to separate them into two columns to have user1 and user2. The way I was solving this was to split the names into multiple columns then merge the names to have the two columns of users.

The issue I run into is some names are long and after the split. Those names take some spot on the data frame which makes it harder to merge properly.

Users
Maria Melinda Del Valle Justin Howard
Devin Craig Jr. Michael Carter III
Jeanne De Bordeaux Alhamdi

After I split the user columns

0 1 2 3 4 5 6 7 8
Maria Melinda Del Valle Justin Howard
Devin Craig Jr. Michael Carter III
Jeanne De Bordeaux Alhamdi

The expected result is the following

User1 User2
Maria Melinda Del valle Justin Howard
Devin Craig Jr. Michael Carter III
Jeanne De Bordeaux Alhamdi

CodePudding user response:

What database server are you using ? I suggest to write a store procedure, that will have two local variables with string (nvarchar) type and a new temp table. You will need to use the split columns, that you produced earlier. Each time you will find empty string in a column you will append to the new variable. In each row you will store the two local variables in the new temp table. Finally, you can select the results from temp table as the store procedure result

CodePudding user response:

You can use:

def f(sr):
   m = sr.isna().cumsum().loc[lambda x: x < 2]
   return sr.dropna().groupby(m).apply(' '.join)

out = df.apply(f, axis=1).rename(columns=lambda x: f'User{x 1}')

Output:

>>> out
                     User1               User2
0  Maria Melinda Del Valle       Justin Howard
1          Devin Craig Jr.  Michael Carter III
2       Jeanne De Bordeaux             Alhamdi

As suggested by @Barmar, If you know where to put the blank columns in the first split, you should know how to create both columns.

  • Related