Iterate over column in pandas, and dynamically change its value until a new one is found [Pandas]-CodePudding

I'm working with Pandas and have the following dataset (yes, all values are String type):

data = {'stage':['1', '1', '1', '1','2','2','2','4','4','4','4','4'],
        'hour':['Berlim','1','2','3', 'Munich','1','2','Leipzig','1','2','3','4'],
        'location':['Berlim','1','2','3', 'Munich','1','2','Leipzig','1','2','3','4']
       }
 
df = pd.DataFrame(data)

df

Output:

    stage   hour        location
0   1       Berlim      Berlim
1   1       1           1
2   1       2           2
3   1       3           3
4   2       Munich      Munich
5   2       1           1
6   2       2           2
7   4       Leipzig     Leipzig
8   4       1           1
9   4       2           2
10  4       3           3
11  4       4           4

The goal is:

Repeat the value found on df['location] while df['location'] does not have letters on it.
At the end I'll apply a filter to remove values from df where df['hour'] = df['location'], that I decided to not ask since I haven't tried this yet.

So for 1) the desired output is:

    stage   hour        location
0   1       Berlim      Berlim
1   1       1           Berlim
2   1       2           Berlim
3   1       3           Berlim
4   2       Munich      Munich
5   2       1           Munich
6   2       2           Munich
7   4       Leipzig     Leipzig
8   4       1           Leipzig
9   4       2           Leipzig
10  4       3           Leipzig
11  4       4           Leipzig

and for 2) desired output is:

    stage   hour        location
0   1       1           Berlim
1   1       2           Berlim
2   1       3           Berlim
3   2       1           Munich
4   2       2           Munich
5   4       1           Leipzig
6   4       2           Leipzig
7   4       3           Leipzig
8   4       4           Leipzig

So I started to try to first to fill df['location'] and that's what I can't do. Running the code below I always have "Berlim" for all records.

for index, row in df.iterrows():
    isHeader = bool(re.search('[A-Z]', row['location']))
    print('>>>> evaluation(location, isHeader) - ',row['location'], ' , ', isHeader)
    if isHeader == True:
        currentHeader = row['location']
        print("> new header to be used on the next rows: ", currentHeader)
        df['currentHeader'] = currentHeader
    else:
        print('> not a header, so ',currentHeader, 'will be used')
        df['location'] = row['location']
        print('> new pair: ', row['location'], currentHeader)
        df['currentHeader'] = currentHeader
        
df

Current Output:

    stage   hour        location
0   1       Berlim      Berlim
1   1       1           Berlim
2   1       2           Berlim
3   1       3           Berlim
4   2       Munich      Berlim
5   2       1           Berlim
6   2       2           Berlim
7   4       Leipzig     Berlim
8   4       1           Berlim
9   4       2           Berlim
10  4       3           Berlim
11  4       4           Berlim

Anyone can help me, please? This is a logic problem I'm failing, I just can't understand why. If there's a better way to do this, feel free to share.

Thank you!

EDIT Also tried this one, but it's going to repeat for all records and stores last value for df['location'] and applies to all records, Leipzig in this case.

for index, row in df.iterrows():
isHeader = bool(re.search('[A-Z]', row['location']))
print('>>>> evaluation(location, isHeader) - ',row['location'], ' , ', isHeader)
if isHeader == True:
    currentHeader = row['location']
    print("> new header to be used on the next rows: ", currentHeader)
    df['currentHeader'] = currentHeader
else:
    print('> not a header, so ',df['currentHeader'], 'will be used')
    row['location'] = row['location']
    print("else: header", row['location'], currentHeader)
    df['currentHeader'] = currentHeader

CodePudding user response：

import pandas as pd

data = {'stage':['1', '1', '1', '1','2','2','2','4','4','4','4','4'],
        'hour':['Berlim','1','2','3', 'Munich','1','2','Leipzig','1','2','3','4'],
        'location':['Berlim','1','2','3', 'Munich','1','2','Leipzig','1','2','3','4']
       }
 
df = pd.DataFrame(data)

df.loc[df.location.str.isnumeric(),'location'] = None
df.ffill(inplace=True)

and if you'd like to drop the non numeric for your second output

df.loc[df.hour.str.isnumeric()]

Output

   stage hour location
1      1    1   Berlim
2      1    2   Berlim
3      1    3   Berlim
5      2    1   Munich
6      2    2   Munich
8      4    1  Leipzig
9      4    2  Leipzig
10     4    3  Leipzig
11     4    4  Leipzig

CodePudding user response：

I didn't read all in detail but, IIUC, compute a boolean mask. You will use it to mask and ffill the non word rows, and to slice the output.

output #1:

# make a mask of rows that "have letters in it"
mask = df['location'].str.contains('[a-z]', case=False)

# use the mask to hide the non-match and fill with previous value
out = df.assign(location=df['location'].where(mask).ffill())

output:

   stage     hour location
0      1   Berlim   Berlim
1      1        1   Berlim
2      1        2   Berlim
3      1        3   Berlim
4      2   Munich   Munich
5      2        1   Munich
6      2        2   Munich
7      4  Leipzig  Leipzig
8      4        1  Leipzig
9      4        2  Leipzig
10     4        3  Leipzig
11     4        4  Leipzig

output #2:

Identical by also use the (inverted) mask to slice the output

mask = df['location'].str.contains('[a-z]', case=False)
out2 = df.assign(location=df['location'].where(mask).ffill())[~mask]

or from the previous output:

out2 = out[~mask]

output:

   stage hour location
1      1    1   Berlim
2      1    2   Berlim
3      1    3   Berlim
5      2    1   Munich
6      2    2   Munich
8      4    1  Leipzig
9      4    2  Leipzig
10     4    3  Leipzig
11     4    4  Leipzig

CodePudding user response：

import pandas as pd
data = {'stage':['1', '1', '1', '1','2','2','2','4','4','4','4','4'],
        'hour':['Berlim','1','2','3', 'Munich','1','2','Leipzig','1','2','3','4'],
        'location':['Berlim','1','2','3', 'Munich','1','2','Leipzig','1','2','3','4']
       }
df = pd.DataFrame(data)
df['location'] = df['location'].apply(lambda x: None if x.isdigit() else x)
df['location'] = df['location'].fillna(method='ffill')
df