How to automatically set an index to a Pandas DataFrame when reading a CSV with or without an index-CodePudding

Say I have two CSV files. The first one, input_1.csv, has an index column, so when I run:

import pandas as pd
df_1 = pd.read_csv("input_1.csv")
df_1

I get a DataFrame with an index column, as well as a column called Unnamed: 0, which is the same as the index column. I can prevent this duplication by adding the argument index_col=0 and everything is fine.

The second file, input_2.csv, has no index column, i.e., it looks like this:

|   stuff |   things |
|--------:|---------:|
|       1 |       10 |
|       2 |       20 |
|       3 |       30 |
|       4 |       40 |
|       5 |       50 |

Running pd.read_csv("input_2.csv") gives me a DataFrame with an index column. In this case, adding the index_col=0 argument will set in the index column to stuff, as in the CSV file itself.

My problem is that I have a function that contains the read_csv part, and I want it to return a DataFrame with an index column in either case. Is there a way to detect whether the input file has an index column or not, set one if it doesn't, and do nothing if it does?

CodePudding user response：

CSV has no built-in notion of an "index" column, so the answer I think is that this isn't possible in general.

It would be nice if you could say "use 0 as index only if unnamed", but Pandas does not give us that option.

Therefore you will probably need to just check if an Unnamed: column appears, and set those columns to be the index.

CodePudding user response：

By index, I hope you mean a column with serial number either starting at 0 or 1.

You can have some kind of post-import logic to decide, if first column qualifies as an index column:

The logic is, if difference between default index and first column is same for all rows, then the first column contains increasing sequence (starting at any number). Pre-condition is that the column should be numeric.

For example:

   idx value
0    1     a
1    2     b
2    3     c
3    4     d
4    5     e
5    6     f

pd.api.types.is_numeric_dtype(df[df.columns[0]])
>> True

np.array(df.index) - df.iloc[:,0].values
>> array([-1, -1, -1, -1, -1, -1])

# If all values are equal
len(pd.Series(np.array(df.index) - df.iloc[:,0].values).unique()) == 1
>> True