I am relatively new to pandas and python. I am trying to cleaning up some scraped PDF data read into a data frame that isn't isn't correctly parsed. where a few of the single row in the source PDF table is split into three rows like below:
index | Item | Quantity | Price |
---|---|---|---|
0 | some string that is split... | NaN | NaN |
1 | NaN | 4 | 20 |
2 | ... rest of string | NaN | NaN |
3 | some string that parsed as it should | 1 | 22 |
4 | another string that is split... | NaN | NaN |
5 | NaN | 3 | 121 |
6 | ... rest of the other string | NaN | NaN |
7 | 3rd string that is split... | NaN | NaN |
8 | NaN | 15 | 5 |
9 | ... rest of the 3rd string | NaN | NaN |
When ever there is this pattern I basically need to merge the three lines into one line.
so the desired output is
index | Item | Quantity | Price |
---|---|---|---|
0 | some string that is split... ... rest of string | 4 | 20 |
3 | some string that parsed as it should | 1 | 22 |
4 | another string that is split... ... rest of the other string | 3 | 121 |
7 | 3rd string that is split... ...rest of the 3rd string | 15 | 5 |
What would be the best way to do this?
what I have done so far is very clumsy and feels quite wrong even though it sometime works on some of the scraped data. It just can't trust it.
# Get the rows that have NaN in the quantity column
split_rows = df[df['Quantity'].isna()]
# every other line where 'Total' is NaN needs to be removed later, but not before the dtata is copied
rows_to_remove = split_rows.iloc[1::2,:].index
#
__i = 0
while __i < len(split_rows):
df.at[split_rows.index[__i],'Quantity'] = df.at[split_rows.index[__i] 1,'Quantity']
df.at[split_rows.index[__i],'Price'] = df.at[split_rows.index[__i] 1,'Price']
df.at[split_rows.index[__i],'Item'] = str(df.at[split_rows.index[__i],'Item']) " " str(df.at[split_rows.index[__i] 2,'Item'])
__i = 1
# remove the rows like row 3
df = df.drop(rows_to_remove, axis=0)
# drops rows like row 2
df.dropna(inplace=True)
Is there a better way to do this. What would be the canonical way to approach something like this?
much obliged
EDIT:
made a table of the desired output, made the input more like the ones I am dealing with. clarified the question.
CodePudding user response:
The solution is a bit tricky. First, create a boolean match to exclude right rows. Next, with this mask create virtual groups of bad rows then reindex the output with the original index of your dataframe and use interpolate to create new groups for good rows. The rest is pretty standard: group rows the aggregate data:
m = df.isna().any(axis=1)
g = df.loc[m, 'Item'].isna().cumsum().shift(-1).ffill().reindex(df.index).interpolate()
out = (df.fillna({'Item': ' '}).groupby(g, as_index=False)
.agg({'Item': ''.join, 'Quantity': max, 'Price': max}))
Output:
>>> out
Item Quantity Price
0 some string that is split... ... rest of string 4.0 20.0
1 some string that parsed as it should 1.0 22.0
2 another string that is split... ... rest of the other string 3.0 121.0
3 3rd string that is split... ... rest of the 3rd string 15.0 5.0
>>> g
0 1.0 # bad rows, group 1
1 1.0 # bad rows, group 1
2 1.0 # bad rows, group 1
3 1.5 # good row, group 1.5 due to interpolation
4 2.0 # bad rows, group 2
5 2.0 # bad rows, group 2
6 2.0 # bad rows, group 2
7 3.0 # bad rows, group 3
8 3.0 # bad rows, group 3
9 3.0 # bad rows, group 3
Name: Item, dtype: float64
CodePudding user response:
Index is set before grouping df.set_index('index'). A function 'func_data' is created to process the values. First, the values of 'np.nan' are checked by the isin function. Next, values are extracted where there are no empty values and all columns are converted to string so that it is possible to combine all values into one line in 'join'.
def func_data(x):
ind = x.isin([np.nan])
a = x.values[~ind]
a = a.astype(str)
return (" ".join(a))
df = df.set_index('index')
df['new'] = df.groupby(['index'])[['Item','Quantity', 'Price']].apply(func_data)
df = df.reset_index('index')
print(df)
Output
index Item ... Price new
0 1 some string that is split... ... NaN some string that is split...
1 2 NaN ... 20.0 4.0 20.0
2 3 ... rest of string ... NaN ... rest of string