I have a dataframe from a .txt file, and I am only interested in the data that appears between the <Header>
tags.
0 1
0 webmaster @.com
1 <Header> 121112
2 ReportID 5353
3 Date 20210630
4 Type DMV13
5 </Header> None
6 ZIP 90279
7 State WV
I therefore want to drop all other rows, but can't do it on position because the <Header>
tags appear in different rows depending on the file. So is there a way to drop every row above/below a string match, so that the output looks like:
0 1
0 ReportID 5353
1 Date 20210630
2 Type DMV13
CodePudding user response:
You could extract the index
of the rows where the sign <
or >
is contained using str.contains(r'..|..')
, and then filter your dataframe with iloc
:
# Extract the min and max index of rows that contain '<' or '>'
min_i = df[df['0'].str.contains(r'<|>')].index.min()
max_i = df[df['0'].str.contains(r'<|>')].index.max()
# Filter by index location
df.iloc[min_i 1:max_i,:].reset_index(drop=True)
0 1 t
0 ReportID 5353 0
1 Date 20210630 0
2 Type DMV13 0
Note that iloc
uses is an integer position indexing based lower boundary being an inclusive boundary and upper boundary being an exclusive boundary. Hence the 1 on min_i
no addition on max_i
CodePudding user response:
This as well:
start = df[df['0']=='<Header>'].index[0] 1
end = df[df['0']=='</Header>'].index[0]
df.iloc[start:end].reset_index(drop=True)
0 1
0 ReportID 5353
1 Date 20210630
2 Type DMV13
Or the comment I left produces the same result
df[df['0'].isin(['ReportID', 'Date', 'Type'])]