Home > database >  Taking the column names from the first row that has less than x Nan's
Taking the column names from the first row that has less than x Nan's

Time:05-23

I have data as follows:

import pandas as pd

url_cities="https://population.un.org/wup/Download/Files/WUP2018-F12-Cities_Over_300K.xls"
df_cities = pd.read_excel(url_cities)
print(df_cities.iloc[0:20,])

The column names can be found in row 15, but I would like this row number to be automatically determined. I thought the best way would be to take the first row for which the values are non-Nan for less than 10 items.

I combined this answer, to find this answer to do the following:

amount_Nan = df_cities.shape[1] - df_cities.count(axis=1)
# OR df.isnull().sum(axis=1).tolist()

print(amount_Nan)
col_names_index = next(i for i in amount_Nan if i < 3)
print(col_names_index)

df_cities.columns = df_cities.iloc[col_names_index]

The problem is that col_names_index keeps returning 0, while it should be 15. I think it is because amount_Nan returns rows and columns because of which next(i for i in amount_Nan if i < 3) works differently than expected.

The thing is that I do not really understand why. Can anyone help?

CodePudding user response:

IIUC you can get first index of non missing value per second column by DataFrame.iloc with Series.notna and Series.idxmax, set columns names by this row and filter out values before this row by index:

i = df_cities.iloc[:, 1].notna().idxmax()
              
df_cities.columns = df_cities.iloc[i].tolist()
df_cities = df_cities.iloc[i 1:]
  • Related