Selection of a condition by a range that includes strings (letter numbers)-CodePudding

Please help. I have two tables: 1 report and 1 data file.

The data table is presented as follows:

PATIENTS_ID	POL	Age	ICD10
10848754	0	22	H52
10848754	0	22	R00
10848754	0	22	Z01
10848754	0	22	Z02
10850478	1	26	H52

And etc.

The report file asks to collect the following data:

ICD10	Male (20-29)	Male (30-39)	Female (20-29)	Female (30-39)
C00 - C97
E10 - E14
I00 - I99

So... I need to collect all "ICD10" data which include the gap between C00 to C99, and aggregate together with gender and age span. I know that in SQL there is a "BETWEEN " that will quite easily build a range and select values like this without additional conditions: "C00, C01, C02". Is there something similar in python/pandas?

Logical expressions like ">= C00 <= C99" will include other letters, already tried. I would be grateful for help. Creating a separate parser/filter seems too massive for such a job.

CodePudding user response：

I'll assume that Excel can be used for a solution.

Lets say PATIENS_ID is column A.... ICD10 is column D. You can use this expression to count in range:

=COUNTIFS(D:D,">=C00",D:D,"<=C99")

I'll assume that "POL" is gender and 0 is male. So formula for Male (20-29) and ICD10 C00 - C99 would be:

=COUNTIFS(D:D,">=C00",D:D,"<=C99",C:C,">=20",C:C,"<=29",B:B,"0")

CodePudding user response：

If there is only one letter as "identifier", like C02, E34, etc. you can split your column ICD10 into two columns, first one is the first character of ICD10, and second are the numbers.

df.loc[:, "Letter_identifier"] = df["ICD10"].str[0]

df.loc[:, "Number_identifier"] = df["ICD10"].str[1:].astype(int)

Then you can create a masks like:

(df["Letter_identifier"] == "C") & (df["Number_identifier"] > 0) & (df["Number_identifier"] <= 99)

You can split your dataframe as shown, aggregate on those sub-dataframes and concat your result.