I asked a similar question on here before, but I realize now that the data I am working with has an additional layer of complexity.
Suppose I have a data set that looks like this
Unique_Identifier Score1 Score2
110 10 60
112,114-116 50 70
119 30 20
120-122 90 90
125,127 70 80
Notice how some of my unique identifiers are listed as ranges, rather than exact values. I want to split up those rows such that each Unique Identifier has it's own row with the same data so that it would look like the following table. "," represents a separation of values, whereas "-" represents a range of values.
Unique_Identifier Score1 Score2
110 10 60
112 50 70
114 50 70
115 50 70
116 50 70
119 30 20
120 90 90
121 90 90
122 90 90
125 70 80
127 70 80
How would I go about doing this in Python using Pandas? I believe the answers to the more simple version the last time I asked this involved an explode function. I should also note that there are indeed no spaces in the Unique Identifier column when , or - are present.
CodePudding user response:
split
and explode
in two steps: once on the "," and once on the "-":
df["Unique_Identifier"] = df["Unique_Identifier"].str.split(",")
df = df.explode("Unique_Identifier")
df["Unique_Identifier"] = df["Unique_Identifier"].apply(lambda x: list(range(int(x.split("-")[0]),int(x.split("-")[1]) 1)) if "-" in x else [int(x)])
df = df.explode("Unique_Identifier")
>>> df
Unique_Identifier Score1 Score2
0 110 10 60
1 112 50 70
1 114 50 70
1 115 50 70
1 116 50 70
2 119 30 20
3 120 90 90
3 121 90 90
3 122 90 90
4 125 70 80
4 127 70 80
CodePudding user response:
You can use a custom function to split the ranges to actual numbers, then explode
the column:
def torange(s):
for x in s.split(','):
if '-' in x:
a,b = x.split('-')
yield from range(int(a), int(b) 1)
else:
yield int(x)
df2 = (df
.assign(**{'Unique_Identifier': df['Unique_Identifier']
.apply(lambda x: list(torange(x)))})
.explode('Unique_Identifier')
)
NB. the torange
function only accepts integers, commas, and ranges in the form X-Y
where X and Y are integers, else it will fail
output:
Unique_Identifier Score1 Score2
0 110 10 60
1 112 50 70
1 114 50 70
1 115 50 70
1 116 50 70
2 119 30 20
3 120 90 90
3 121 90 90
3 122 90 90
4 125 70 80
4 127 70 80
edit: previous more barbaric function:
def torange(s):
return [y for x in s.split(',') for y in
(list(*(range(a, b 1) for a,b in [map(int, x.split('-'))]))
if '-' in x else [int(x)])]