Please help. I have two tables: 1 report and 1 data file.
The data table is presented as follows:
PATIENTS_ID | POL | Age | ICD10 |
---|---|---|---|
10848754 | 0 | 22 | H52 |
10848754 | 0 | 22 | R00 |
10848754 | 0 | 22 | Z01 |
10848754 | 0 | 22 | Z02 |
10850478 | 1 | 26 | H52 |
And etc.
The report file asks to collect the following data:
ICD10 | Male (20-29) | Male (30-39) | Female (20-29) | Female (30-39) |
---|---|---|---|---|
C00 - C97 | ||||
E10 - E14 | ||||
I00 - I99 |
So... I need to collect all "ICD10" data which include the gap between C00 to C99, and aggregate together with gender and age span. I know that in SQL there is a "BETWEEN " that will quite easily build a range and select values like this without additional conditions: "C00, C01, C02". Is there something similar in python/pandas?
Logical expressions like ">= C00 <= C99" will include other letters, already tried. I would be grateful for help. Creating a separate parser/filter seems too massive for such a job.
CodePudding user response:
I'll assume that Excel can be used for a solution.
Lets say PATIENS_ID is column A.... ICD10 is column D. You can use this expression to count in range:
=COUNTIFS(D:D,">=C00",D:D,"<=C99")
I'll assume that "POL" is gender and 0 is male. So formula for Male (20-29) and ICD10 C00 - C99 would be:
=COUNTIFS(D:D,">=C00",D:D,"<=C99",C:C,">=20",C:C,"<=29",B:B,"0")
CodePudding user response:
If there is only one letter as "identifier", like C02, E34, etc. you can split your column ICD10 into two columns, first one is the first character of ICD10, and second are the numbers.
df.loc[:, "Letter_identifier"] = df["ICD10"].str[0]
df.loc[:, "Number_identifier"] = df["ICD10"].str[1:].astype(int)
Then you can create a masks like:
(df["Letter_identifier"] == "C") & (df["Number_identifier"] > 0) & (df["Number_identifier"] <= 99)
You can split your dataframe as shown, aggregate on those sub-dataframes and concat your result.