I'm working with Pandas and have the following dataset (yes, all values are String
type):
data = {'stage':['1', '1', '1', '1','2','2','2','4','4','4','4','4'],
'hour':['Berlim','1','2','3', 'Munich','1','2','Leipzig','1','2','3','4'],
'location':['Berlim','1','2','3', 'Munich','1','2','Leipzig','1','2','3','4']
}
df = pd.DataFrame(data)
df
Output:
stage hour location
0 1 Berlim Berlim
1 1 1 1
2 1 2 2
3 1 3 3
4 2 Munich Munich
5 2 1 1
6 2 2 2
7 4 Leipzig Leipzig
8 4 1 1
9 4 2 2
10 4 3 3
11 4 4 4
The goal is:
- Repeat the value found on
df['location]
whiledf['location']
does not have letters on it. - At the end I'll apply a filter to remove values from df where
df['hour'] = df['location']
, that I decided to not ask since I haven't tried this yet.
So for 1) the desired output is:
stage hour location
0 1 Berlim Berlim
1 1 1 Berlim
2 1 2 Berlim
3 1 3 Berlim
4 2 Munich Munich
5 2 1 Munich
6 2 2 Munich
7 4 Leipzig Leipzig
8 4 1 Leipzig
9 4 2 Leipzig
10 4 3 Leipzig
11 4 4 Leipzig
and for 2) desired output is:
stage hour location
0 1 1 Berlim
1 1 2 Berlim
2 1 3 Berlim
3 2 1 Munich
4 2 2 Munich
5 4 1 Leipzig
6 4 2 Leipzig
7 4 3 Leipzig
8 4 4 Leipzig
So I started to try to first to fill df['location'] and that's what I can't do. Running the code below I always have "Berlim" for all records.
for index, row in df.iterrows():
isHeader = bool(re.search('[A-Z]', row['location']))
print('>>>> evaluation(location, isHeader) - ',row['location'], ' , ', isHeader)
if isHeader == True:
currentHeader = row['location']
print("> new header to be used on the next rows: ", currentHeader)
df['currentHeader'] = currentHeader
else:
print('> not a header, so ',currentHeader, 'will be used')
df['location'] = row['location']
print('> new pair: ', row['location'], currentHeader)
df['currentHeader'] = currentHeader
df
Current Output:
stage hour location
0 1 Berlim Berlim
1 1 1 Berlim
2 1 2 Berlim
3 1 3 Berlim
4 2 Munich Berlim
5 2 1 Berlim
6 2 2 Berlim
7 4 Leipzig Berlim
8 4 1 Berlim
9 4 2 Berlim
10 4 3 Berlim
11 4 4 Berlim
Anyone can help me, please? This is a logic problem I'm failing, I just can't understand why. If there's a better way to do this, feel free to share.
Thank you!
EDIT
Also tried this one, but it's going to repeat for all records and stores last value for df['location']
and applies to all records, Leipzig
in this case.
for index, row in df.iterrows():
isHeader = bool(re.search('[A-Z]', row['location']))
print('>>>> evaluation(location, isHeader) - ',row['location'], ' , ', isHeader)
if isHeader == True:
currentHeader = row['location']
print("> new header to be used on the next rows: ", currentHeader)
df['currentHeader'] = currentHeader
else:
print('> not a header, so ',df['currentHeader'], 'will be used')
row['location'] = row['location']
print("else: header", row['location'], currentHeader)
df['currentHeader'] = currentHeader
CodePudding user response:
import pandas as pd
data = {'stage':['1', '1', '1', '1','2','2','2','4','4','4','4','4'],
'hour':['Berlim','1','2','3', 'Munich','1','2','Leipzig','1','2','3','4'],
'location':['Berlim','1','2','3', 'Munich','1','2','Leipzig','1','2','3','4']
}
df = pd.DataFrame(data)
df.loc[df.location.str.isnumeric(),'location'] = None
df.ffill(inplace=True)
and if you'd like to drop the non numeric for your second output
df.loc[df.hour.str.isnumeric()]
Output
stage hour location
1 1 1 Berlim
2 1 2 Berlim
3 1 3 Berlim
5 2 1 Munich
6 2 2 Munich
8 4 1 Leipzig
9 4 2 Leipzig
10 4 3 Leipzig
11 4 4 Leipzig
CodePudding user response:
I didn't read all in detail but, IIUC, compute a boolean mask. You will use it to mask and ffill
the non word rows, and to slice the output.
output #1:
# make a mask of rows that "have letters in it"
mask = df['location'].str.contains('[a-z]', case=False)
# use the mask to hide the non-match and fill with previous value
out = df.assign(location=df['location'].where(mask).ffill())
output:
stage hour location
0 1 Berlim Berlim
1 1 1 Berlim
2 1 2 Berlim
3 1 3 Berlim
4 2 Munich Munich
5 2 1 Munich
6 2 2 Munich
7 4 Leipzig Leipzig
8 4 1 Leipzig
9 4 2 Leipzig
10 4 3 Leipzig
11 4 4 Leipzig
output #2:
Identical by also use the (inverted) mask to slice the output
mask = df['location'].str.contains('[a-z]', case=False)
out2 = df.assign(location=df['location'].where(mask).ffill())[~mask]
or from the previous output:
out2 = out[~mask]
output:
stage hour location
1 1 1 Berlim
2 1 2 Berlim
3 1 3 Berlim
5 2 1 Munich
6 2 2 Munich
8 4 1 Leipzig
9 4 2 Leipzig
10 4 3 Leipzig
11 4 4 Leipzig
CodePudding user response:
import pandas as pd
data = {'stage':['1', '1', '1', '1','2','2','2','4','4','4','4','4'],
'hour':['Berlim','1','2','3', 'Munich','1','2','Leipzig','1','2','3','4'],
'location':['Berlim','1','2','3', 'Munich','1','2','Leipzig','1','2','3','4']
}
df = pd.DataFrame(data)
df['location'] = df['location'].apply(lambda x: None if x.isdigit() else x)
df['location'] = df['location'].fillna(method='ffill')
df