I have a large dataset I am working through with Pandas that is pulled from Excel.
The data contains a header for each section and the relevant data underneath for each unit, i.e.
<index> | Date/Time| UnitNo. | Reading
0 | NaN | UnitBc36 | NaN
1 | 1/1/2021 | NaN | 100
2 | 1/1/2021 | NaN | 300
...
260253 | NaN | UnitAN95 | NaN
272644 | 3/1/2021 | NaN | 150
I have been able to locate all indexes of the unit headers as below by identifying where the UnitNo. columns is not NaN:
Loc = df[df["UnitNo."].notnull()]
In: Loc.index
Out: ([ 0, 12393, 24786, 37179, 49572, 61965, 74358, 86751,
99144, 111537, 123930, 136323, 148716, 161109, 173502, 185895,
198288, 210681, 223074, 235467, 247860, 260253],
dtype='int64')
I also have a dataframe that matches each index to the relevant header value.
I want to use this information to attach the Unit number as a new column to each row of data, so that later I can groupby this information and analyse each unit.
For example, all rows between index 0 and 12393 (Loc.index[0] and Loc.index[1] above), should have UnitBc36 as a new column. All units between 260253 and the final line (Loc.index[21] above) should have UnitAN95 in the new column. See below:
<index> | Date/Time| UnitNo. | Reading | UnitNo._new_col
0 | NaN | UnitBc36 | NaN | UnitBc36
1 | 1/1/2021 | NaN | 100 | UnitBc36
2 | 1/1/2021 | NaN | 300 | UnitBc36
...
260253 | NaN | UnitAN95 | NaN | UnitAN95
272644 | 3/1/2021 | NaN | 150 | UnitAN95
Could anyone please provide assistance on the best way to approach the above?
I currently am manually doing this by using a large number of condition statements but this is not scalable for different data sizes with different numbers of units.
Any help would be much appreciated and happy to clarify further if required.
Thanks, Jack
CodePudding user response:
Use ffill
(or fillna(method='ffill')
):
df['UnitNo._new_col'] = df['UnitNo.'].ffill()
print(df)
# Output:
Date/Time UnitNo. Reading UnitNo._new_col
0 NaN UnitBc36 NaN UnitBc36
1 1/1/2021 NaN 100.0 UnitBc36
2 1/1/2021 NaN 300.0 UnitBc36
260253 NaN UnitAN95 NaN UnitAN95
272644 3/1/2021 NaN 150.0 UnitAN95
To go further, maybe you want:
out = df.assign(**{'UnitNo.': df['UnitNo.'].ffill()}) \
.loc[lambda x: x['Date/Time'].notna()]
print(out)
# Output:
Date/Time UnitNo. Reading
1 1/1/2021 UnitBc36 100.0
2 1/1/2021 UnitBc36 300.0
272644 3/1/2021 UnitAN95 150.0