I'd like to drop all data in a .txt file that I've converted to a data frame after the second instance of a column value. In this case a delimiter "---".
Dataframe is constructed as follows:
15 Leading Causes of Death 15 Code Deaths Population Crude Rate Crude Rate Lower 95% Confidence Interval Crude Rate Upper 95% Confidence Interval
#Accidents (unintentional injuries) (V01-X59,Y85-Y86) GR113-112 21 152430 13.8 8.5 21.1
#Intentional self-harm (suicide) (*U03,X60-X84,Y87.0) GR113-124 15 152430 Unreliable 5.5 16.2
---
Dataset: Underlying Cause of Death, 1999-2019
Query Parameters:
States: Marin County, CA (06041)
Ten-Year Age Groups: 25-34 years
Year/Month: 1999; 2000; 2001; 2002; 2003
Group By: 15 Leading Causes of Death
Show Totals: Disabled
Show Zero Values: Disabled
Show Suppressed: Disabled
Calculate Rates Per: 100,000
Rate Options: Default intercensal populations for years 2001-2009 (except Infant Age Groups)
---
Help: See http://wonder.cdc.gov/wonder/help/ucd.html for more information.
---
Query Date: Sep 23, 2021 6:51:59 PM
I've seen plenty of solutions for how to do so after the first instance of a column value or NaN etc. but nothing for the second or nth for that matter...
Here's the simple code I have so far to read in the file.
import pandas as pd
dl = pd.read_csv('Underlying Cause of Death, 1999-2019(3).txt', sep = '\t')
dl.to_csv('test.csv', index = False)
CodePudding user response:
Find rows start with '---' and apply a cumulative sum then get the index of the first row equals to 2 and slice your dataframe to this index.
>>> df.iloc[:df.iloc[:, 0].str.startswith('---').cumsum().eq(2).idxmax()]
0 #Accidents (unintentional injuries) (V01-X59,Y... GR113-112 21.0 152430.0 13.8 8.5 21.1
1 #Intentional self-harm (suicide) (*U03,X60-X84... GR113-124 15.0 152430.0 Unreliable 5.5 16.2
2 --- NaN NaN NaN NaN NaN NaN
3 Dataset: Underlying Cause of Death, 1999-2019 NaN NaN NaN NaN NaN NaN
4 Query Parameters: NaN NaN NaN NaN NaN NaN
5 States: Marin County, CA (06041) NaN NaN NaN NaN NaN NaN
6 Ten-Year Age Groups: 25-34 years NaN NaN NaN NaN NaN NaN
7 Year/Month: 1999 2000 2001.0 2002.0 2003 NaN NaN
8 Group By: 15 Leading Causes of Death NaN NaN NaN NaN NaN NaN
9 Show Totals: Disabled NaN NaN NaN NaN NaN NaN
10 Show Zero Values: Disabled NaN NaN NaN NaN NaN NaN
11 Show Suppressed: Disabled NaN NaN NaN NaN NaN NaN
12 Calculate Rates Per: 100,000 NaN NaN NaN NaN NaN NaN
13 Rate Options: Default intercensal populations ... NaN NaN NaN NaN NaN NaN