Not sure how to return the desired output for k for the condition on the string "START".
The column names dont need to be used/retained.
NOTE: I will not know in which row and column my string will be, hence the use of the for/if statements.
Input:
df = pd.DataFrame({'A': ['43', 23, 'Ndfg', 34, 0, 56],
'B': ['5', 23, 'START', 89, 0, 4],
'C': ['65', 7, 'dsfgA', 65, 47, 3],
'D': ['65', 7, 'gfd', 3, 0, 7],
'E': ['76', 7, 'Start', 5, 12, 1],
'F': ['65', 7, 'sdfA', 5, 0, 4],
'G': ['12', 7, 'START', 5, 8, 9],
'H': ['89', 7, 'gfA', 5, 0, 8],
'I': ['23', 7, 'sdfA', 5, 7, 23]})
k = []
for rw in range(df.shape[0]):
for colm in range(df.shape[1]):
if df.iloc[rw, colm] == 'START':
row = rw 1
col = colm
k.append(df.iloc[row:, col:])
break
Output: df and k
df
A B C D E F G H I
0 43 5 65 65 76 65 12 89 23
1 23 23 7 7 7 7 7 7 7
2 Ndfg START dsfgA gfd Start sdfA START gfA sdfA
3 34 89 65 3 5 5 5 5 5
4 0 0 47 0 12 0 8 0 7
5 56 4 3 7 1 4 9 8 23
k
[ B C D E F G H I
3 89 65 3 5 5 5 5 5
4 0 47 0 12 0 8 0 7
5 4 3 7 1 4 9 8 23]
Desired Output for k: Only the columns of numbers under "Start"
[ B E G
3 89 5 5
4 0 12 8
5 4 1 9]
CodePudding user response:
(df[df.astype(str)
.apply(lambda s: s.str.lower())
.eq("start")
.cummax()]
.dropna(how="all", axis=0)
.dropna(how="all", axis=1)
.iloc[1:])
get a True/False mask dataframe indicating value is "start" or not; to do this
- convert to str
- convert to lowercase
- check equality against "start"
now we have
In [350]: df
Out[350]:
A B C D E F G H I
0 43 5 65 65 76 65 12 89 23
1 23 23 7 7 7 7 7 7 7
2 Ndfg START dsfgA gfd Start sdfA START gfA sdfA
3 34 89 65 3 5 5 5 5 5
4 0 0 47 0 12 0 8 0 7
5 56 4 3 7 1 4 9 8 23
In [351]: df.astype(str).apply(lambda s: s.str.lower()).eq("start")
Out[351]:
A B C D E F G H I
0 False False False False False False False False False
1 False False False False False False False False False
2 False True False False True False True False False
3 False False False False False False False False False
4 False False False False False False False False False
5 False False False False False False False False False
- now the "region" marking: want below parts of True
- False is 0, True is 1 in numeric context
- what if we took cumulative maximum?
In [352]: _.cummax()
Out[352]:
A B C D E F G H I
0 False False False False False False False False False
1 False False False False False False False False False
2 False True False False True False True False False
3 False True False False True False True False False
4 False True False False True False True False False
5 False True False False True False True False False
(_ is the last output above this)
- ok we got the Trues and Falses in place
- index into
df
with it - values in Trues will be kept, Falses will see NaN
- index into
In [357]: df[_]
Out[357]:
A B C D E F G H I
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN START NaN NaN Start NaN START NaN NaN
3 NaN 89 NaN NaN 5 NaN 5 NaN NaN
4 NaN 0 NaN NaN 12 NaN 8 NaN NaN
5 NaN 4 NaN NaN 1 NaN 9 NaN NaN
- almost there; NaNs need to go
- dropna over rows but with
how="all"
- so that only if a row is all NaNs, it is dropped
- do the same for columns too
- dropna over rows but with
In [358]: _.dropna(how="all", axis=0).dropna(how="all", axis=1)
Out[358]:
B E G
2 START Start START
3 89 5 5
4 0 12 8
5 4 1 9
- last thing: no need the "start" row, so
.iloc[1:]
it out
In [359]: _.iloc[1:]
Out[359]:
B E G
3 89 5 5
4 0 12 8
5 4 1 9
extra: time comparison for a frame with ~50_000 entries
# this is 6000 x 9
In [364]: df = pd.concat([df] * 1000, ignore_index=True)
In [366]: %%timeit
...: (
...: df
...: .applymap(lambda x: type(x) == str and x.upper() == "START")
...: .stack()
...: .loc[lambda x: x]
...: .reset_index()
...: .apply(lambda x: df.loc[slice(x["level_0"] 1, None), x['level_1']], axis=1)
...: .T
...: )
...:
...:
4.11 s ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [367]: %%timeit
...: (df[df.astype(str)
...: .apply(lambda s: s.str.lower())
...: .eq("start").cummax()]
...: .dropna(how="all", axis=0)
...: .dropna(how="all", axis=1)
...: .iloc[1:])
...:
27.1 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
~150 times
CodePudding user response:
One way to do:
df2 = df.apply(lambda x:x.str.upper()) #First convert str to upper
# Find the index where the START may appear - I am assuming there is only one such row.
df2 = df2.query('|'.join(df2.columns.map('{}=="START"'.format).tolist()))
row_index = df2.index[0] #get the index where START appears
s = df2.eq('START').iloc[0] #Find 'START'
col_index = s[s].index # get the columns where START appears
df.iloc[row_index 1:][col_index] # Then locate
output:
B E G
3 89 5 5
4 0 12 8
5 4 1 9
CodePudding user response:
(
df
.applymap(lambda x: type(x) == str and x.upper() == "START")
.stack()
.loc[lambda x: x]
.reset_index()
.apply(lambda x: df.loc[slice(x["level_0"] 1, None), x['level_1']], axis=1)
.T
)