Home > Software engineering >  How to read excel files when some cells have multiple rows
How to read excel files when some cells have multiple rows

Time:11-06

I have to read multiple large excel files to attempt to clean the data.

I am down to the last problem of some cells have multiple rows within them or I guess some cells span multiple rows.

It is something like this:

Index Col1 Col2   Col3
1     row1 row1   row1
2          row1.1
3          row1.2
4     row2 row2   row3

When I use Pandas.read_excel(filename) or Pandas.ExcelFile then sheet.parse(sheetname) it of course reads in index 2 and 3 with mostly blank lines.

How would I go about merging index 2 and 3 into 1 based off what Col1 spans?

To be clear my question is: How could I read in an excel file and merge rows based off what rows the first column spans? Is this even possible?

Thanks

CodePudding user response:

I don't know that this functionality is built into Pandas, since frankly Excel is not intended to be used like this but people still tend to abuse the heck out of it. Man I hate Excel.....but that's a topic for another thread.

I think your best bet here is to define a custom function based on the logic that you know applies to these files. As I am currently in the middle of a project dealing with diverse and poorly-formatted Excel files, I'm all too familiar with this kind of garbage.

This is my suggestion, based on my understanding of the data and what you're asking. It may need to be changed depending on the specifics of your files.

last_valid = None
check_cols = []  # if only need to check a subset of cols for validity, do it here

for i, s in df.iterrows():  # This is slow, but probably necessary in this case
    """ If all the rows are valid, we want to keep it as a reference in case
    the following rows are not """
    if all(s[check_cols].notna()):
        lvi, last_valid = i, s  
        # need to store index and series so we can go back and replace it
        continue
    else:  # here is the critical part
        extra_vals = s[s.notna()]  # find cells in row that have actual values
        for col in extra_vals.index:
            """ I'm creating a list and appending here since I don't know
            your values or how they need to be handled exactly"""
            last_valid[col] = list(last_valid[col]).append(extra_vals[col])
        # replace that row in the dataframe
        df.iloc[lvi, :] = last_valid

# drop extra rows:
df = df.dropna(axis=0, subset=check_cols)

Hope this works for ya!

CodePudding user response:

@LiamFiddler answer is correct but needed some adjustment to work in my situation as I am combining numbers on the same line and will be going out to a csv as strings. I am posting mine in case it helps someone that comes here

last_valid = None

        check_cols = ['Col1']  # if only need to check a subset of cols for validity, do it here
       

        df = df.astype(str) #convert all columns to strings as I have to combine numbers in the same cell

        df = df.replace('nan','') #get rid of the nan created back to a blank string

        for i, s in df.iterrows():  # This is slow, but probably necessary in this case
           

            """ If all the rows are valid, we want to keep it as a reference in case

            the following rows are not """

            if all(s[check_cols] != ''):

                lvi, last_valid = i, s

                # need to store index and series so we can go back and replace it

                continue

            else:  # here is the critical part

                extra_vals = s[s != '']  # find cells in row that have actual values

                for col in extra_vals.index:

                    """ I'm creating a list and appending here since I don't know

                    your values or how they need to be handled exactly"""

                    last_valid[col] = last_valid[col]   ","   extra_vals[col] #separate by whatever you wish, list was causing issues

                # replace that row in the dataframe

                df.iloc[lvi, :] = last_valid

 

        # drop extra rows:

        df = df[df['Col1'] != ''].reset_index(drop=True)
  • Related