Home > Back-end >  Pandas copy value from one column to another based on a value third column that is not constant
Pandas copy value from one column to another based on a value third column that is not constant

Time:12-04

I have a large dataset that is one huge table that actually should be many tables. The headers are buried in the rows for the subsets.

My goal is to pull those headers out into a new column such that I can filter by that column to get the data I want (one header at a time). I've created an empty header column for this. There is always a series of 3 NaN values in the SCORE column where the first value in the NAME column in that series is the HEADER I want. So I'm thinking something about that relationship could be leveraged.

Current Pandas data frame has this structure:

s = '''HEADER,NAME,SCORE
NaN,Header 1,NaN
NaN,Random Junk,NaN
NaN,Random Junk,NaN
NaN,Ed,98
NaN,Gary,78
NaN,Floyd,89
NaN,Header 2,NaN
NaN,Random Junk,NaN
NaN,Random Junk,NaN
NaN,Mary,96
NaN,Steve,78'''

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(s))
HEADER   NAME              SCORE
NaN      Header 1          NaN
NaN      Random Junk       NaN
NaN      Random Junk       NaN
NaN      Ed                98
NaN      Gary              78
NaN      Floyd             89
...      ...               ...
NaN      Header 2          NaN
NaN      Random Junk       NaN
NaN      Random Junk       NaN
NaN      Mary              96
NaN      Steve             78

and I want this:

HEADER        NAME              SCORE
Header 1      Header 1          NaN
Header 1      Random Junk       NaN
Header 1      Random Junk       NaN
Header 1      Ed                98
Header 1      Gary              78
Header 1      Floyd             89
...           ...               ...
Header 2      Header 2          NaN
Header 2      Random Junk       NaN
Header 2      Random Junk       NaN
Header 2      Mary              96
Header 2      Steve             78

so I can then remove the NaN rows and get what I'm truly after which is this:

HEADER        NAME              SCORE
Header 1      Ed                98
Header 1      Gary              78
Header 1      Floyd             89
...           ...               ...
Header 2      Mary              96
Header 2      Steve             78

After much searching, I'm not able to figure out how to do this conditional editing like this. Would appreciate any help you can give.

CodePudding user response:

The header lines occur when SCORE has 3 NaN and 1 non-NaN in sequence, so:

  1. Check for this condition using shift, isna, and notna.
  2. mask the HEADER column as NAME when this condition is met.
  3. ffill (forward fill) the new HEADER.
  4. dropna based on the SCORE.
is_header = df.SCORE.isna() & df.SCORE.shift(-1).isna() & df.SCORE.shift(-2).isna() & df.SCORE.shift(-3).notna()
df.HEADER = df.HEADER.mask(is_header, df.NAME).ffill()
df = df.dropna(subset=['SCORE'])

#       HEADER   NAME  SCORE
# 3   Header 1     Ed   98.0
# 4   Header 1   Gary   78.0
# 5   Header 1  Floyd   89.0
# 9   Header 2   Mary   96.0
# 10  Header 2  Steve   78.0
  • Related