Home > database >  How to strip blank values on multiple columns of a pandas dataFrame
How to strip blank values on multiple columns of a pandas dataFrame

Time:03-09

I have data in Data-frame like below, where you see column values for different column and Nan appearing in between, if will use df.dropna('') then it will leave behind an empty cell for a column which i don't want rather i want remove Nan and strip the blank so only host* will sum up rest stripped.

Actual post related to this is here

This is my dataframe:

df = pd.read_csv("server.csv", usecols=['name', 'managed_by'])
df = df.pivot(columns='managed_by', values='name')

the above code producing below ..

Sam         Peter   Jesse   Patrick     Banu
host1       host5   host7   host9       host10
host2       host6   host8               host11
host3       Nan     Nan                 Nan
host4       Nan     Nan                 Nan
Nan         host22  Nan                 Nan
host24      Nan     Nan                 Nan
host23      Nan     Nan                 Nan
    

I want below:

Sam         Peter   Jesse   Patrick     Banu
host1       host5   host7   host9       host10
host2       host6   host8               host11
host3       host22                      
host4                           
host23      

any help will be much appreciated.

CodePudding user response:

If you have real NaNs, use apply with dropna and reset_index:

df.apply(lambda c: c.dropna().reset_index(drop=True))

or, with concat:

pd.concat([df[c].dropna().reset_index(drop=True) for c in df], axis=1)

output:

      Sam   Peter  Jesse Patrick    Banu
0   host1   host5  host7   host9  host10
1   host2   host6  host8     NaN  host11
2   host3  host22    NaN     NaN     NaN
3   host4     NaN    NaN     NaN     NaN
4  host24     NaN    NaN     NaN     NaN
5  host23     NaN    NaN     NaN     NaN

For "blank" cells, fill with empty string:

df.apply(lambda c: c.dropna().reset_index(drop=True)).fillna('')

output:

      Sam   Peter  Jesse Patrick    Banu
0   host1   host5  host7   host9  host10
1   host2   host6  host8          host11      
2   host3  host22                       
3   host4                               
4  host24                               
5  host23                               

NB. if string 'Nan', first replace them using df.replace('Nan', float('nan')) or df.mask(df.eq('NaN'))

CodePudding user response:

Update

From this dataframe:

>>> df
   managed_by     name
0       host1      Sam
1       host2      Sam
2       host3      Sam
3       host4      Sam
4       host5    Peter
5       host6    Peter
6       host7    Jesse
7       host8    Jesse
8       host9  Patrick
9      host10     Banu
10     host11     Banu

Use: (slightly variation of my old answer)

out = (
  df.assign(index=lambda x: x.groupby('name').cumcount())
    .pivot_table('managed_by', 'index', 'name', aggfunc='first', fill_value='')
    [df['name'].unique()].rename_axis(index=None, columns=None)
)

Output:

>>> out
     Sam  Peter  Jesse Patrick    Banu
0  host1  host5  host7   host9  host10
1  host2  host6  host8          host11
2  host3                              
3  host4                              

Old answer

You can use melt to flatten your dataframe and pivot_table to reshape it:

out = (
  df.melt().dropna().assign(index=lambda x: x.groupby('variable').cumcount())
    .pivot_table('value', 'index', 'variable', aggfunc='first', fill_value='')
    [df.columns].rename_axis(index=None, columns=None)
)

Output:

>>> out
      Sam   Peter  Jesse Patrick    Banu
0   host1   host5  host7   host9  host10
1   host2   host6  host8          host11
2   host3  host22                       
3   host4                               
4  host24                               
5  host23                               

CodePudding user response:

df = df.replace('',np.nan)#Make the empty space NaNs
s =df.fillna(method='bfill').dropna(thresh=2)#backfill the NaNs and drop any that does not have 2 non nulss
s.mask(s.apply(lambda x:x.duplicated())).fillna('')#.duplicated(keep='last', axis=0)#Coditionally drop rest of rows, observing multiplicity. Fill NaNs with space

Outcome

     Sam   Peter  Jesse Patrick    Banu
0   host1   host5  host7   host9  host10
1   host2   host6  host8          host11
2   host3  host22                       
3   host4                               
4  host24 

    
  • Related