Say I have two CSV files. The first one, input_1.csv
, has an index column, so when I run:
import pandas as pd
df_1 = pd.read_csv("input_1.csv")
df_1
I get a DataFrame with an index column, as well as a column called Unnamed: 0
, which is the same as the index column. I can prevent this duplication by adding the argument index_col=0
and everything is fine.
The second file, input_2.csv
, has no index column, i.e., it looks like this:
| stuff | things |
|--------:|---------:|
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 4 | 40 |
| 5 | 50 |
Running pd.read_csv("input_2.csv")
gives me a DataFrame with an index column. In this case, adding the index_col=0
argument will set in the index column to stuff
, as in the CSV file itself.
My problem is that I have a function that contains the read_csv
part, and I want it to return a DataFrame with an index column in either case. Is there a way to detect whether the input file has an index column or not, set one if it doesn't, and do nothing if it does?
CodePudding user response:
CSV has no built-in notion of an "index" column, so the answer I think is that this isn't possible in general.
It would be nice if you could say "use 0 as index only if unnamed", but Pandas does not give us that option.
Therefore you will probably need to just check if an Unnamed:
column appears, and set those columns to be the index.
CodePudding user response:
By index, I hope you mean a column with serial number either starting at 0
or 1
.
You can have some kind of post-import logic to decide, if first column qualifies as an index column:
The logic is, if difference between default index and first column is same for all rows, then the first column contains increasing sequence (starting at any number). Pre-condition is that the column should be numeric.
For example:
idx value
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
pd.api.types.is_numeric_dtype(df[df.columns[0]])
>> True
np.array(df.index) - df.iloc[:,0].values
>> array([-1, -1, -1, -1, -1, -1])
# If all values are equal
len(pd.Series(np.array(df.index) - df.iloc[:,0].values).unique()) == 1
>> True