I am trying to loop through a column in a pandas data frame to remove unnecessary white space in the beginning and end of the strings within the column. My data frame looks like this:
df={'c1': [' ab', 'fg', 'ac ', 'hj-jk ', ' ac', 'df, gh', 'gh', 'ab', 'ad', 'jk-pl', 'ae', 'kl-kl '], 'b2': ['ba', 'bc', 'bd', 'be', 'be', 'be', 'ba'] }
c1 b2
0 ab, fg
1 ac, hj-jk
2 ac, df,gh
3 gh, be
4 ab, be
5 ad, jk-pl
6 ae, kl-kl
I tried the this answer here, but did not work either. The reason I need to remove the white space from the strings in this column is that I want to one hot encode this column using get.dummies() function. My idea was to use the strip() function to remove the white space from each value and then I use .str.get_dummies(','):
#function to remove white space from strings
def strip_string(dataframe, column_name):
for id, item in dataframe[column_name].items():
a=item.strip()
#removing the white space from the values of the column
strip_string(df, 'c1')
#creating one hot-encoded columns from the values using split(",")
df1=df['c1'].str.get_dummies(',')
but my code returns duplicate columns and I don't want this...I suppose the function to remove the white space is not working well? Can anyone help? My current output is:
ab ac df fg gh hj-jk jk-pl kl-kl ab ac ad ae gh
0 1 0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0 1 0 0 0
2 0 1 1 0 1 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 1
4 0 0 0 0 0 0 0 0 1 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 1 0 0
6 0 0 0 0 0 0 0 1 0 0 0 1 0
columns 'ac' and 'ab' are duplicated. I want to remove the duplicated columns
CodePudding user response:
I would stack
, strip
, get_dummies
, and groupby.max
:
If the separator is ', '
:
df.stack().str.strip().str.get_dummies(sep=', ').groupby(level=0).max()
else:
df.stack().str.replace(r'\s', '', regex=True).str.get_dummies(sep=',').groupby(level=0).max()
output:
ab ac ba bc bd be df fg gh hj-jk
0 1 0 1 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 1 0 0
2 0 1 0 0 1 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 1
4 0 1 0 0 0 1 0 0 0 0
5 0 0 0 0 0 1 1 0 1 0
6 0 0 1 0 0 0 0 0 1 0
CodePudding user response:
UPDATED:
I think you need to handle spaces around commas as well as at the start/end of a string in order for Series.str.get_dummies()
to work correctly for your example:
df = df.apply(lambda x: x.str.strip().str.replace(' *, *', ','))
Input:
c1 b2
0 ab foo
1 fg foo
2 ac foo
3 hj-jk foo
4 ac foo
5 df, gh foo
6 gh foo
7 ab foo
8 ad foo
9 jk-pl foo
10 ae foo
11 kl-kl foo
Intermediate dataframe (after removing spaces at start and end and adjacent to commas):
c1 b2
0 ab foo
1 fg foo
2 ac foo
3 hj-jk foo
4 ac foo
5 df,gh foo
6 gh foo
7 ab foo
8 ad foo
9 jk-pl foo
10 ae foo
11 kl-kl foo
Output:
ab ac ad ae df fg gh hj-jk jk-pl kl-kl
0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 1 0 0
4 0 1 0 0 0 0 0 0 0 0
5 0 0 0 0 1 0 1 0 0 0
6 0 0 0 0 0 0 1 0 0 0
7 1 0 0 0 0 0 0 0 0 0
8 0 0 1 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 1 0
10 0 0 0 1 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 1
If you just use strip()
(as in my earlier answer below), you will get something like this with a duplicate for gh
:
gh ab ac ad ae df fg gh hj-jk jk-pl kl-kl
0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 1 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 0
4 0 0 1 0 0 0 0 0 0 0 0
5 1 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 0 0 1 0 0 0
7 0 1 0 0 0 0 0 0 0 0 0
8 0 0 0 1 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 1 0
10 0 0 0 0 1 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 1
EARLIER ANSWER:
Either of the following should work:
df = df.applymap(lambda x: x.strip())
... or:
df = df.apply(lambda x: x.str.strip())
CodePudding user response:
See if this helps:
import numpy as np
import pandas as pd
data={'c1': [' ab ', 'fg', 'ac ', 'hj-jk '], 'b2': ['ba', 'bc', 'bd', 'be'] }
df=pd.DataFrame(data)
print(df.head())
df=df.apply(lambda x: x.map(str.strip))
print(df.head())