I have to read multiple large excel files to attempt to clean the data.
I am down to the last problem of some cells have multiple rows within them or I guess some cells span multiple rows.
It is something like this:
Index Col1 Col2 Col3
1 row1 row1 row1
2 row1.1
3 row1.2
4 row2 row2 row3
When I use Pandas.read_excel(filename) or Pandas.ExcelFile then sheet.parse(sheetname) it of course reads in index 2 and 3 with mostly blank lines.
How would I go about merging index 2 and 3 into 1 based off what Col1 spans?
To be clear my question is: How could I read in an excel file and merge rows based off what rows the first column spans? Is this even possible?
Thanks
CodePudding user response:
I don't know that this functionality is built into Pandas, since frankly Excel is not intended to be used like this but people still tend to abuse the heck out of it. Man I hate Excel.....but that's a topic for another thread.
I think your best bet here is to define a custom function based on the logic that you know applies to these files. As I am currently in the middle of a project dealing with diverse and poorly-formatted Excel files, I'm all too familiar with this kind of garbage.
This is my suggestion, based on my understanding of the data and what you're asking. It may need to be changed depending on the specifics of your files.
last_valid = None
check_cols = [] # if only need to check a subset of cols for validity, do it here
for i, s in df.iterrows(): # This is slow, but probably necessary in this case
""" If all the rows are valid, we want to keep it as a reference in case
the following rows are not """
if all(s[check_cols].notna()):
lvi, last_valid = i, s
# need to store index and series so we can go back and replace it
continue
else: # here is the critical part
extra_vals = s[s.notna()] # find cells in row that have actual values
for col in extra_vals.index:
""" I'm creating a list and appending here since I don't know
your values or how they need to be handled exactly"""
last_valid[col] = list(last_valid[col]).append(extra_vals[col])
# replace that row in the dataframe
df.iloc[lvi, :] = last_valid
# drop extra rows:
df = df.dropna(axis=0, subset=check_cols)
Hope this works for ya!
CodePudding user response:
@LiamFiddler answer is correct but needed some adjustment to work in my situation as I am combining numbers on the same line and will be going out to a csv as strings. I am posting mine in case it helps someone that comes here
last_valid = None
check_cols = ['Col1'] # if only need to check a subset of cols for validity, do it here
df = df.astype(str) #convert all columns to strings as I have to combine numbers in the same cell
df = df.replace('nan','') #get rid of the nan created back to a blank string
for i, s in df.iterrows(): # This is slow, but probably necessary in this case
""" If all the rows are valid, we want to keep it as a reference in case
the following rows are not """
if all(s[check_cols] != ''):
lvi, last_valid = i, s
# need to store index and series so we can go back and replace it
continue
else: # here is the critical part
extra_vals = s[s != ''] # find cells in row that have actual values
for col in extra_vals.index:
""" I'm creating a list and appending here since I don't know
your values or how they need to be handled exactly"""
last_valid[col] = last_valid[col] "," extra_vals[col] #separate by whatever you wish, list was causing issues
# replace that row in the dataframe
df.iloc[lvi, :] = last_valid
# drop extra rows:
df = df[df['Col1'] != ''].reset_index(drop=True)