Home > other >  Forward filling missing columns with a cumulative counter
Forward filling missing columns with a cumulative counter

Time:04-06

I have an excel file. It has columns without a name. This columns are the next instance of the last named column. I would the columns without a name to be named after the last named column a counter of how many empty columns there were since the last named one.

I have something like this:

enter image description here

If I just read the .csv normally I get:

   a  b  Unnamed: 2  Unnamed: 3  c  Unnamed: 5  d
0  1  1           1           1  1           1  1
1  1  1           1           1  1           1  1
2  1  1           1           1  1           1  1
3  1  1           1           1  1           1  1
4  1  1           1           1  1           1  1

So I read with header=None, than I get the columns in the first row, where I can use ffill to fill them like I wish. The only I still want to add is a counter.

I would want my output to be like:

   a b0          b1          b2 c0          c1  d
0  1  1           1           1  1           1  1
1  1  1           1           1  1           1  1
2  1  1           1           1  1           1  1
3  1  1           1           1  1           1  1
4  1  1           1           1  1           1  1

CodePudding user response:

First forward filling missing values create by Series.mask with ffill and then add counter:

s = df.columns.to_series()
df.columns = s.mask(s.str.startswith('Unnamed')).ffill()

#https://stackoverflow.com/a/61853830/2901002
from collections import defaultdict

renamer = defaultdict()

for column_name in df.columns[df.columns.duplicated(keep=False)].tolist():
    if column_name not in renamer:
        renamer[column_name] = [column_name '0']
    else:
        renamer[column_name].append(column_name   str(len(renamer[column_name])))

df = df.rename(
    columns=lambda column_name: renamer[column_name].pop(0)
    if column_name in renamer 
    else column_name
)
print (df)
   a  b0  b1  b2  c0  c1  d
0  1   1   1   1   1   1  1
1  1   1   1   1   1   1  1
2  1   1   1   1   1   1  1
3  1   1   1   1   1   1  1
4  1   1   1   1   1   1  1

Another idea is use default pandas function for deduplicate columns names:

s = df.columns.to_series()

df.columns = s.mask(s.str.startswith('Unnamed')).ffill()

#https://stackoverflow.com/a/43792894/2901002
df.columns = pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)
print (df)
   a  b  b.1  b.2  c  c.1  d
0  1  1    1    1  1    1  1
1  1  1    1    1  1    1  1
2  1  1    1    1  1    1  1
3  1  1    1    1  1    1  1
4  1  1    1    1  1    1  1
  • Related