how to remove white space from strings of data frame column?-CodePudding

I am trying to loop through a column in a pandas data frame to remove unnecessary white space in the beginning and end of the strings within the column. My data frame looks like this:

df={'c1': [' ab', 'fg', 'ac ', 'hj-jk ', ' ac', 'df, gh', 'gh', 'ab', 'ad', 'jk-pl', 'ae', 'kl-kl '], 'b2': ['ba', 'bc', 'bd', 'be', 'be', 'be', 'ba'] }


    c1  b2
0   ab, fg
1   ac, hj-jk   
2   ac, df,gh   
3   gh, be
4   ab, be
5   ad, jk-pl
6   ae, kl-kl

I tried the this answer here, but did not work either. The reason I need to remove the white space from the strings in this column is that I want to one hot encode this column using get.dummies() function. My idea was to use the strip() function to remove the white space from each value and then I use .str.get_dummies(','):

#function to remove white space from strings
def strip_string(dataframe, column_name):
  for id, item in dataframe[column_name].items():
    a=item.strip()

#removing the white space from the values of the column
strip_string(df, 'c1')

#creating one hot-encoded columns from the values using split(",")

df1=df['c1'].str.get_dummies(',')

but my code returns duplicate columns and I don't want this...I suppose the function to remove the white space is not working well? Can anyone help? My current output is:

   ab   ac  df  fg  gh  hj-jk   jk-pl   kl-kl   ab  ac  ad  ae  gh
0   1   0   0   1   0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   1   0   0   0   1   0   0   0
2   0   1   1   0   1   0   0   0   0   0   0   0   0
3   0   0   0   0   0   0   0   0   0   0   0   0   1
4   0   0   0   0   0   0   0   0   1   0   0   0   0
5   0   0   0   0   0   0   1   0   0   0   1   0   0
6   0   0   0   0   0   0   0   1   0   0   0   1   0

columns 'ac' and 'ab' are duplicated. I want to remove the duplicated columns

CodePudding user response：

I would stack, strip, get_dummies, and groupby.max:

If the separator is ', ':

df.stack().str.strip().str.get_dummies(sep=', ').groupby(level=0).max()

else:

df.stack().str.replace(r'\s', '', regex=True).str.get_dummies(sep=',').groupby(level=0).max()

output:

   ab  ac  ba  bc  bd  be  df  fg  gh  hj-jk
0   1   0   1   0   0   0   0   0   0      0
1   0   0   0   1   0   0   0   1   0      0
2   0   1   0   0   1   0   0   0   0      0
3   0   0   0   0   0   1   0   0   0      1
4   0   1   0   0   0   1   0   0   0      0
5   0   0   0   0   0   1   1   0   1      0
6   0   0   1   0   0   0   0   0   1      0

CodePudding user response：

UPDATED:

I think you need to handle spaces around commas as well as at the start/end of a string in order for Series.str.get_dummies() to work correctly for your example:

df = df.apply(lambda x: x.str.strip().str.replace(' *, *', ','))

Input:

        c1   b2
0       ab  foo
1       fg  foo
2      ac   foo
3   hj-jk   foo
4       ac  foo
5   df, gh  foo
6       gh  foo
7       ab  foo
8       ad  foo
9    jk-pl  foo
10      ae  foo
11  kl-kl   foo

Intermediate dataframe (after removing spaces at start and end and adjacent to commas):

       c1   b2
0      ab  foo
1      fg  foo
2      ac  foo
3   hj-jk  foo
4      ac  foo
5   df,gh  foo
6      gh  foo
7      ab  foo
8      ad  foo
9   jk-pl  foo
10     ae  foo
11  kl-kl  foo

Output:

    ab  ac  ad  ae  df  fg  gh  hj-jk  jk-pl  kl-kl
0    1   0   0   0   0   0   0      0      0      0
1    0   0   0   0   0   1   0      0      0      0
2    0   1   0   0   0   0   0      0      0      0
3    0   0   0   0   0   0   0      1      0      0
4    0   1   0   0   0   0   0      0      0      0
5    0   0   0   0   1   0   1      0      0      0
6    0   0   0   0   0   0   1      0      0      0
7    1   0   0   0   0   0   0      0      0      0
8    0   0   1   0   0   0   0      0      0      0
9    0   0   0   0   0   0   0      0      1      0
10   0   0   0   1   0   0   0      0      0      0
11   0   0   0   0   0   0   0      0      0      1

If you just use strip() (as in my earlier answer below), you will get something like this with a duplicate for gh:

     gh  ab  ac  ad  ae  df  fg  gh  hj-jk  jk-pl  kl-kl
0     0   1   0   0   0   0   0   0      0      0      0
1     0   0   0   0   0   0   1   0      0      0      0
2     0   0   1   0   0   0   0   0      0      0      0
3     0   0   0   0   0   0   0   0      1      0      0
4     0   0   1   0   0   0   0   0      0      0      0
5     1   0   0   0   0   1   0   0      0      0      0
6     0   0   0   0   0   0   0   1      0      0      0
7     0   1   0   0   0   0   0   0      0      0      0
8     0   0   0   1   0   0   0   0      0      0      0
9     0   0   0   0   0   0   0   0      0      1      0
10    0   0   0   0   1   0   0   0      0      0      0
11    0   0   0   0   0   0   0   0      0      0      1

EARLIER ANSWER:

Either of the following should work:

df = df.applymap(lambda x: x.strip())

... or:

df = df.apply(lambda x: x.str.strip())

CodePudding user response：

See if this helps:

import numpy as np
import pandas as pd
data={'c1': [' ab ', 'fg', 'ac ', 'hj-jk '], 'b2': ['ba', 'bc', 'bd', 'be'] }
df=pd.DataFrame(data)
print(df.head())
df=df.apply(lambda x: x.map(str.strip))
print(df.head())