Home > Software design >  Replace NaN values in Dataframe column by shifting columns over to the right
Replace NaN values in Dataframe column by shifting columns over to the right

Time:09-27

I'm looking to split and shift data I have in my dataframe over into a column that contains NaNs

Upon reading in the data from a pdf and creating the table, the output is:

   Output
   |       Summary       |  Prior Years|1|2|3|4| 5 |6|7|8|9|10|  Total  |
   |---------------------|-------------|-|-|-|-|---|-|-|-|-|--|---------|
   |Total Value 3,700,000|     110     |-|-|-|5|NaN|-|-|-|-|--|3,815,000|  
   |Total Value 320,000  |     110     |-|-|-|5|NaN|-|-|-|-|--| 435,000 |

Due to the PDF format, reading in the data causes the first column to combine 'Total Value' and the expected value for 'Prior Years'

Expected Output
   |  Summary  |Prior Years| 1 |2|3|4|5|6|7|8|9|10|  Total  |
   |-----------|-----------|---|-|-|-|-|-|-|-|-|--|---------|
   |Total Value| 3,700,000 |110|-|-|-|5|-|-|-|-|--|3,815,000|  
   |Total Value|  320,000  |110|-|-|-|5|-|-|-|-|--| 435,000 |

All the data after the '5' column that contains the NaNs is as expected

Is there a way to split the data in the 'Summary' column and have all the data shift over until the NaN columns are then occupied?

CodePudding user response:

Just ffill through axis = 1

df.ffill(axis=1)

After that you can easily replace your 4 columns to np.nan

df['4'] = df['4'].replace(5,np.nan)

CodePudding user response:

Did you try using df.shift.

df.shift(periods_to_be_shifted, axis = 1)

in your case the periods_to_be_shifted = 1

try ,

df.shift(1, axis = 1)

CodePudding user response:

Not sure what dtype all of your columns are, but something like this should work:

import pandas as pd
import numpy as np

# example dataframe (leaves off last few columns, which aren't relevant)
df = pd.DataFrame({'Summary': ['Total Value 3,700,000', 'Total Value 320,000'], 'Prior Years': [110, 110],
                   '1': ['-', '-'], '2': ['-', '-'], '3': ['-', '-'], '4': [5, 5], '5': [np.nan, np.nan],
                   '6': ['-', '-']})

# create list of column names, drop na column, and rename relevant columns (cols 1 - 5, just shift each name back by one)
columns = df.columns.to_list()
new_col_dict = {columns[i]: columns[i   1] for i in range(1,6)}
df.drop(columns=['5'], inplace=True)
df.rename(columns=new_col_dict, inplace=True)

# split up Summary column (based on spaces)
df.loc[:, 'Prior Years'] = df.Summary.str.split(" ").apply(lambda x: x[2])
df.loc[:, 'Summary'] = df.Summary.str.split(" ").apply(lambda x: x[0])   " "   df.Summary.str.split(" ").apply(lambda x: x[1])

# if you want the "Prior Years" column to be int type:
df.loc[:, 'Prior Years'] = df['Prior Years'].str.replace(',', '')
df.loc[:, 'Prior Years'] = df['Prior Years'].astype(int)

# re-order dataframe columns, if you care to
df = df[['Summary', 'Prior Years', '1', '2', '3', '4', '5', '6']]

CodePudding user response:

here is one way to do it. based on the stated assumption that values upto column 5 are to be shifted from left, and Summary column is to break into two columns

# shift the values from 'Prior Years' thur column 5 to right and assign to col '1' thru '5'
df.loc[:,"1":"5"]=df.loc[:,"Prior Years":"5"].shift(axis=1)

# split the summary into text (Total Value) and the value, and assign to
# Summary and Prior Years columns

df[['Summary','Prior Years']]=df['Summary'].str.strip().str.extract(r'(\D*).*?([\d\,\.]*)' )
df

    Summary       Prior Years     1     2   3   4   5   6   7   8   9   10  Total
0   Total Value     3,700,000   110     -   -   -   5   -   -   -   -   --  3,815,000
1   Total Value       320,000   110     -   -   -   5   -   -   -   -   --  435,000
  • Related