I'm looking to split and shift data I have in my dataframe over into a column that contains NaNs
Upon reading in the data from a pdf and creating the table, the output is:
Output
| Summary | Prior Years|1|2|3|4| 5 |6|7|8|9|10| Total |
|---------------------|-------------|-|-|-|-|---|-|-|-|-|--|---------|
|Total Value 3,700,000| 110 |-|-|-|5|NaN|-|-|-|-|--|3,815,000|
|Total Value 320,000 | 110 |-|-|-|5|NaN|-|-|-|-|--| 435,000 |
Due to the PDF format, reading in the data causes the first column to combine 'Total Value' and the expected value for 'Prior Years'
Expected Output
| Summary |Prior Years| 1 |2|3|4|5|6|7|8|9|10| Total |
|-----------|-----------|---|-|-|-|-|-|-|-|-|--|---------|
|Total Value| 3,700,000 |110|-|-|-|5|-|-|-|-|--|3,815,000|
|Total Value| 320,000 |110|-|-|-|5|-|-|-|-|--| 435,000 |
All the data after the '5' column that contains the NaNs is as expected
Is there a way to split the data in the 'Summary' column and have all the data shift over until the NaN columns are then occupied?
CodePudding user response:
Just ffill through axis = 1
df.ffill(axis=1)
After that you can easily replace your 4 columns to np.nan
df['4'] = df['4'].replace(5,np.nan)
CodePudding user response:
Did you try using df.shift.
df.shift(periods_to_be_shifted, axis = 1)
in your case the periods_to_be_shifted = 1
try ,
df.shift(1, axis = 1)
CodePudding user response:
Not sure what dtype all of your columns are, but something like this should work:
import pandas as pd
import numpy as np
# example dataframe (leaves off last few columns, which aren't relevant)
df = pd.DataFrame({'Summary': ['Total Value 3,700,000', 'Total Value 320,000'], 'Prior Years': [110, 110],
'1': ['-', '-'], '2': ['-', '-'], '3': ['-', '-'], '4': [5, 5], '5': [np.nan, np.nan],
'6': ['-', '-']})
# create list of column names, drop na column, and rename relevant columns (cols 1 - 5, just shift each name back by one)
columns = df.columns.to_list()
new_col_dict = {columns[i]: columns[i 1] for i in range(1,6)}
df.drop(columns=['5'], inplace=True)
df.rename(columns=new_col_dict, inplace=True)
# split up Summary column (based on spaces)
df.loc[:, 'Prior Years'] = df.Summary.str.split(" ").apply(lambda x: x[2])
df.loc[:, 'Summary'] = df.Summary.str.split(" ").apply(lambda x: x[0]) " " df.Summary.str.split(" ").apply(lambda x: x[1])
# if you want the "Prior Years" column to be int type:
df.loc[:, 'Prior Years'] = df['Prior Years'].str.replace(',', '')
df.loc[:, 'Prior Years'] = df['Prior Years'].astype(int)
# re-order dataframe columns, if you care to
df = df[['Summary', 'Prior Years', '1', '2', '3', '4', '5', '6']]
CodePudding user response:
here is one way to do it. based on the stated assumption that values upto column 5 are to be shifted from left, and Summary column is to break into two columns
# shift the values from 'Prior Years' thur column 5 to right and assign to col '1' thru '5'
df.loc[:,"1":"5"]=df.loc[:,"Prior Years":"5"].shift(axis=1)
# split the summary into text (Total Value) and the value, and assign to
# Summary and Prior Years columns
df[['Summary','Prior Years']]=df['Summary'].str.strip().str.extract(r'(\D*).*?([\d\,\.]*)' )
df
Summary Prior Years 1 2 3 4 5 6 7 8 9 10 Total
0 Total Value 3,700,000 110 - - - 5 - - - - -- 3,815,000
1 Total Value 320,000 110 - - - 5 - - - - -- 435,000