Cleaning Messy Scraped Table Data Rows in Pandas-CodePudding

I am relatively new to pandas and python. I am trying to cleaning up some scraped PDF data read into a data frame that isn't isn't correctly parsed. where a few of the single row in the source PDF table is split into three rows like below:

index	Item	Quantity	Price
0	some string that is split...	NaN	NaN
1	NaN	4	20
2	... rest of string	NaN	NaN
3	some string that parsed as it should	1	22
4	another string that is split...	NaN	NaN
5	NaN	3	121
6	... rest of the other string	NaN	NaN
7	3rd string that is split...	NaN	NaN
8	NaN	15	5
9	... rest of the 3rd string	NaN	NaN

When ever there is this pattern I basically need to merge the three lines into one line.

so the desired output is

index	Item	Quantity	Price
0	some string that is split... ... rest of string	4	20
3	some string that parsed as it should	1	22
4	another string that is split... ... rest of the other string	3	121
7	3rd string that is split... ...rest of the 3rd string	15	5

What would be the best way to do this?

what I have done so far is very clumsy and feels quite wrong even though it sometime works on some of the scraped data. It just can't trust it.


# Get the rows that have NaN in the quantity column
split_rows = df[df['Quantity'].isna()]

# every other line where 'Total' is NaN needs to be removed later, but not before the dtata is copied        
rows_to_remove = split_rows.iloc[1::2,:].index
        
# 
__i = 0
while __i < len(split_rows):
   df.at[split_rows.index[__i],'Quantity'] = df.at[split_rows.index[__i] 1,'Quantity']
   df.at[split_rows.index[__i],'Price'] = df.at[split_rows.index[__i] 1,'Price']
   df.at[split_rows.index[__i],'Item'] =  str(df.at[split_rows.index[__i],'Item'])   " "   str(df.at[split_rows.index[__i] 2,'Item'])
    __i  = 1 
        

# remove the rows like row 3
df = df.drop(rows_to_remove, axis=0)

# drops rows like row 2    
df.dropna(inplace=True)

Is there a better way to do this. What would be the canonical way to approach something like this?

much obliged

EDIT:

made a table of the desired output, made the input more like the ones I am dealing with. clarified the question.

CodePudding user response：

The solution is a bit tricky. First, create a boolean match to exclude right rows. Next, with this mask create virtual groups of bad rows then reindex the output with the original index of your dataframe and use interpolate to create new groups for good rows. The rest is pretty standard: group rows the aggregate data:

m = df.isna().any(axis=1)
g = df.loc[m, 'Item'].isna().cumsum().shift(-1).ffill().reindex(df.index).interpolate()

out = (df.fillna({'Item': ' '}).groupby(g, as_index=False)
         .agg({'Item': ''.join, 'Quantity': max, 'Price': max}))

Output:

>>> out
                                                           Item  Quantity  Price
0               some string that is split... ... rest of string       4.0   20.0
1                          some string that parsed as it should       1.0   22.0
2  another string that is split... ... rest of the other string       3.0  121.0
3        3rd string that is split... ... rest of the 3rd string      15.0    5.0

>>> g
0    1.0  # bad rows, group 1
1    1.0  # bad rows, group 1
2    1.0  # bad rows, group 1
3    1.5  # good row, group 1.5 due to interpolation
4    2.0  # bad rows, group 2
5    2.0  # bad rows, group 2
6    2.0  # bad rows, group 2
7    3.0  # bad rows, group 3
8    3.0  # bad rows, group 3
9    3.0  # bad rows, group 3
Name: Item, dtype: float64

CodePudding user response：

Index is set before grouping df.set_index('index'). A function 'func_data' is created to process the values. First, the values of 'np.nan' are checked by the isin function. Next, values are extracted where there are no empty values and all columns are converted to string so that it is possible to combine all values into one line in 'join'.

def func_data(x):
    ind = x.isin([np.nan])
    a = x.values[~ind]
    a = a.astype(str)
    return (" ".join(a))


df = df.set_index('index')

df['new'] = df.groupby(['index'])[['Item','Quantity', 'Price']].apply(func_data)
df = df.reset_index('index')
print(df)

Output

   index                          Item  ...  Price                           new
0      1  some string that is split...  ...    NaN  some string that is split...
1      2                           NaN  ...   20.0                      4.0 20.0
2      3            ... rest of string  ...    NaN            ... rest of string