Home > OS >  Columns getting appended to wrong row in pandas
Columns getting appended to wrong row in pandas

Time:10-03

So I have a dataframe like this:-

     0        1            2       ...
0  Index   Something   Something2  ...
1    1        5            8       ...
2    2        6            9       ...
3    3        7            10      ...

Now, I want to append some columns in between those "Something" column names, for which I have used this code:-

        j = 1
        for i in range(2, 51):
            if i % 2 != 0 and i != 4:
                df.insert(i, f"% Difference {j}", " ")
                j  = 1

where df is the dataframe. Now what happens is that the columns do get inserted but like this:-

     0        1         Difference 1           2       ...
0  Index   Something         NaN           Something2  ...
1    1        5              NaN                8       ...
2    2        6              NaN                9       ...
3    3        7              NaN               10      ...

whereas what I wanted was this:-

     0        1               2                 3       ...
0  Index   Something    Difference 1          Something2  ...
1    1        5              NaN                8       ...
2    2        6              NaN                9       ...
3    3        7              NaN               10      ...

Edit 1 Using jezrael's logic:-

df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop = True)
print(df)

Output of that is still this:-

     0        1            2       ...
0  Index   Something   Something2  ...
1    1        5            8       ...
2    2        6            9       ...
3    3        7            10      ...

Any ideas or suggestions as to where or how I am going wrong?

CodePudding user response:

I want to append some columns in between those "Something" column names

No, there are no columns names Something, for it need set first row of data to columns names:

print (df.columns)
Int64Index([0, 1, 2], dtype='int64')
print (df.iloc[0].tolist())
['Index', 'Something', 'Something2']

df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop=True)
print (df)
  Index Something Something2
0     1         5          8
1     2         6          9
2     3         7         10

print (df.columns)
Index(['Index', 'Something', 'Something2'], dtype='object')

Then your solution create columns Difference, but output is different - no columns 0,1,2,3.

CodePudding user response:

If your dataframe looks like what you've shown in your first code block, your column names aren't Index, Something, etc. - they're actually 0, 1, etc.

Pandas is seeing Index, Something, etc. as data in row 0, NOT as column names (which exist above row 0). So when you add a column with the name Difference 1, you're adding a column above row 0, which is where the range of integers is located.

A couple potential solutions to this:

  1. If you'd like the actual column names to be Index, Something, etc. then the best solution is to import the data with that row as the headers. What is the source of your data? If it's a csv, make sure to NOT use the header = None option. If it's from somewhere else, there is likely an option to pass in a list of the column names to use. I can't think of any reason why you'd want to have a range of integer values as your column names rather than the more descriptive names that you have listed.

  2. Alternatively, you can do what @jezrael suggested and convert your first row of data to column names then delete that data row. I'm not sure why their solution isn't working for you, since the code seems to work fine in my testing. Here's what it's doing:

     df.columns = df.iloc[0].tolist()
    

    df.columns tells pandas what to (re)name the columns of the dataframe. df.iloc[0].tolist() creates a list out of the first row of data, which in your case is the column names that you actually want.

     df = df.iloc[1:].reset_index(drop = True)
    

    This grabs the 2nd through last rows of data to recreate the dataframe. So you have new column names based on the first row, then you recreate the dataframe starting at the second row. The .reset_index(drop = True) isn't totally necessary to include. That just restarts your actual data rows with an index value of 0 rather than 1.

  3. If for some reason you want to keep the column names as they currently exist (as integers rather than labels), you could do something like the following under the if statement in your for loop:

     df.insert(i, i, np.nan, allow_duplicates = True) 
     df.iat[0, i] = f"%Difference {j}"
     df.columns = np.arange(len(df.columns))
    

    The first line inserts a column with an integer label filled with NaN values to start with (assuming you have numpy imported). You need to allow duplicates otherwise you'll get an error since the integer value will be the name of a pre-existing column

    The second line changes the value in the 1st row of the newly-created column to what you want.

    The third line resets the column names to be a range of integers like you had to start with.

As @jezrael suggested, it seems like you might be a little unclear about the difference between column names, indices, and data rows and columns. An index is its own thing, so it's not usually necessary to have a column named Index like you have in your dataframe, especially since that column has the same values in it as the actual index. Clarifying those sorts of things at import can help prevent a lot of hassle later on, so I'd recommend taking a good look at your data source to see if you can create a clearer dataframe to start with!

  • Related