Home > Blockchain >  How to automatically set an index to a Pandas DataFrame when reading a CSV with or without an index
How to automatically set an index to a Pandas DataFrame when reading a CSV with or without an index

Time:10-20

Say I have two CSV files. The first one, input_1.csv, has an index column, so when I run:

import pandas as pd
df_1 = pd.read_csv("input_1.csv")
df_1

I get a DataFrame with an index column, as well as a column called Unnamed: 0, which is the same as the index column. I can prevent this duplication by adding the argument index_col=0 and everything is fine.

The second file, input_2.csv, has no index column, i.e., it looks like this:

|   stuff |   things |
|--------:|---------:|
|       1 |       10 |
|       2 |       20 |
|       3 |       30 |
|       4 |       40 |
|       5 |       50 |

Running pd.read_csv("input_2.csv") gives me a DataFrame with an index column. In this case, adding the index_col=0 argument will set in the index column to stuff, as in the CSV file itself.

My problem is that I have a function that contains the read_csv part, and I want it to return a DataFrame with an index column in either case. Is there a way to detect whether the input file has an index column or not, set one if it doesn't, and do nothing if it does?

CodePudding user response:

CSV has no built-in notion of an "index" column, so the answer I think is that this isn't possible in general.

It would be nice if you could say "use 0 as index only if unnamed", but Pandas does not give us that option.

Therefore you will probably need to just check if an Unnamed: column appears, and set those columns to be the index.

CodePudding user response:

By index, I hope you mean a column with serial number either starting at 0 or 1.

You can have some kind of post-import logic to decide, if first column qualifies as an index column:

The logic is, if difference between default index and first column is same for all rows, then the first column contains increasing sequence (starting at any number). Pre-condition is that the column should be numeric.

For example:

   idx value
0    1     a
1    2     b
2    3     c
3    4     d
4    5     e
5    6     f

pd.api.types.is_numeric_dtype(df[df.columns[0]])
>> True

np.array(df.index) - df.iloc[:,0].values
>> array([-1, -1, -1, -1, -1, -1])

# If all values are equal
len(pd.Series(np.array(df.index) - df.iloc[:,0].values).unique()) == 1
>> True
  • Related