I've scraped a PDF table and it came with an annoying formatting feature.
The table has two columns. In some cases, one row stayed with what should be the column A value and the next stayed with what should be the column B value. Like this:
df = pd.DataFrame()
df['names'] = ['John','Mary',np.nan,'George']
df['numbers'] = ['1',np.nan,'2','3']
I want to reformat that database so wherever there is an empty cell on df['numbers'] it fills it with the value of the next line. Then I apply .dropna() to eliminate the still-wrong cells.
I thied this:
for i in range(len(df)):
if df['numbers'][i] == np.nan:
df['numbers'][i] = df['numbers'][i 1]
No change on the dataframe, though. No error message, too.
What am I missing?
CodePudding user response:
While I don't think this solves all your problems, the reason why you are not updating the dataframe is the line
if df['numbers'][i] == np.nan:
, since this always evaluates to False.
To implement a vlaid test for nan in this case you must use
if pd.isnull(df['numbres'][i]):
this will evaluate to True or False depending on the cell contents.
CodePudding user response:
This is the solution I found:
df[['numbers']] = df[['numbers']].fillna(method='bfill')
df = df[~df['names'].isna()]
It's probably not the most elegant, but it worked.