Home > front end >  Pandas drop rows appearing above/below string match
Pandas drop rows appearing above/below string match

Time:10-18

I have a dataframe from a .txt file, and I am only interested in the data that appears between the <Header> tags.

          0          1  
0 webmaster      @.com
1  <Header>     121112
2  ReportID       5353
3      Date   20210630
4      Type      DMV13
5 </Header>       None
6       ZIP      90279
7     State         WV

I therefore want to drop all other rows, but can't do it on position because the <Header> tags appear in different rows depending on the file. So is there a way to drop every row above/below a string match, so that the output looks like:

          0          1  
0  ReportID       5353
1      Date   20210630
2      Type      DMV13

CodePudding user response:

You could extract the index of the rows where the sign < or > is contained using str.contains(r'..|..'), and then filter your dataframe with iloc:

# Extract the min and max index of rows that contain '<' or '>'
min_i = df[df['0'].str.contains(r'<|>')].index.min()
max_i = df[df['0'].str.contains(r'<|>')].index.max()

# Filter by index location
df.iloc[min_i 1:max_i,:].reset_index(drop=True)

          0         1  t
0  ReportID      5353  0
1      Date  20210630  0
2      Type     DMV13  0 

Note that iloc uses is an integer position indexing based lower boundary being an inclusive boundary and upper boundary being an exclusive boundary. Hence the 1 on min_i no addition on max_i

CodePudding user response:

This as well:

start = df[df['0']=='<Header>'].index[0]   1
end = df[df['0']=='</Header>'].index[0]
df.iloc[start:end].reset_index(drop=True)

          0         1
0  ReportID      5353
1      Date  20210630
2      Type     DMV13

Or the comment I left produces the same result

df[df['0'].isin(['ReportID', 'Date', 'Type'])]
  • Related