Home > Mobile >  Cleaning data in Panda
Cleaning data in Panda

Time:03-21

Background I load data into Panda from a csv/xlsx file created by a text-to-data app. While saving time, the auto-read is only so accurate. Below I have simplified a load to illustrate a specific problem I struggle to sort:

import pandas as pd
from tabulate import tabulate

df_is = {"Var":["Sales","Gogs","Op prof","Depreciation","Net fin","PBT","Tax","PAT"],
"2021":[100,-50,50,-10,-5,35,"",""],
"2022":[125,-55,70,-15,-10,45,-10,25],
"":["","","","","","",-15,30]}

df_want = {"Var":["Sales","Gogs","Op prof","Depreciation","Net fin","PBT","Tax","PAT"],
"2021":[100,-50,50,-10,-5,35,-10,25],
"2022":[125,-55,70,-15,-10,45,-15,30]}

print(tabulate(df_is))
print()
print(tabulate(df_want))

Problem As can be seen by running the code, the data in the first table has not been read properly by the app, resulting in the last two datapoints of the second and third column appearing in third and last column, respectively.

Second table shows how I want it to appear. The real problem is more complex and general, so local solutions of over-writing values is not feasible. A solution, like in Excel, where I would delete the empty cells in the second column and simultaneously move all other data in the rows to the left/right (depending on task), would be good.

Tried Being a novice, I have tried to search for solutions, but none of my search criteria seem to lead to a relevant solution.

I have also used df.iloc() to create a variable of the four data-cells that are out of line, then tried to append them to column 1 and 2. Than only added copies of the last two rows.

Greatful for advise!

versions conda 4.11.0 Python 3.9.7

Pandas 1.3.4

CodePudding user response:

Please try this:

import pandas as pd
import numpy as np
f_is = {"Var":["Sales","Gogs","Op prof","Depreciation","Net fin","PBT","Tax","PAT"],
"2021":[100,-50,50,-10,-5,35,"",""],
"2022":[125,-55,70,-15,-10,45,-10,25],
"":["","","","","","",-15,30]}
input_df = pd.DataFrame(f_is)

output_df = input_df.T.replace('', np.nan).apply(lambda x: pd.Series(x.dropna().to_numpy())).T
output_df.columns = ['Var','2021','2022']
output_df
  • Related