Home > front end >  Pandas/Python: Splitting data into duplicated rows based on commas and dashes
Pandas/Python: Splitting data into duplicated rows based on commas and dashes

Time:05-04

I asked a similar question on here before, but I realize now that the data I am working with has an additional layer of complexity.

Suppose I have a data set that looks like this

Unique_Identifier  Score1 Score2
110                   10     60 
112,114-116           50     70 
119                   30     20 
120-122               90     90 
125,127               70     80 

Notice how some of my unique identifiers are listed as ranges, rather than exact values. I want to split up those rows such that each Unique Identifier has it's own row with the same data so that it would look like the following table. "," represents a separation of values, whereas "-" represents a range of values.

Unique_Identifier  Score1 Score2
110                   10     60 
112                   50     70
114                   50     70
115                   50     70 
116                   50     70
119                   30     20 
120                   90     90
121                   90     90 
122                   90     90
125                   70     80 
127                   70     80  

How would I go about doing this in Python using Pandas? I believe the answers to the more simple version the last time I asked this involved an explode function. I should also note that there are indeed no spaces in the Unique Identifier column when , or - are present.

CodePudding user response:

split and explode in two steps: once on the "," and once on the "-":

df["Unique_Identifier"] = df["Unique_Identifier"].str.split(",")
df = df.explode("Unique_Identifier")
df["Unique_Identifier"] = df["Unique_Identifier"].apply(lambda x: list(range(int(x.split("-")[0]),int(x.split("-")[1]) 1)) if "-" in x else [int(x)])
df = df.explode("Unique_Identifier")

>>> df
 Unique_Identifier  Score1  Score2
0               110      10      60
1               112      50      70
1               114      50      70
1               115      50      70
1               116      50      70
2               119      30      20
3               120      90      90
3               121      90      90
3               122      90      90
4               125      70      80
4               127      70      80

CodePudding user response:

You can use a custom function to split the ranges to actual numbers, then explode the column:

def torange(s):
    for x in s.split(','):
        if '-' in x:
            a,b = x.split('-')
            yield from range(int(a), int(b) 1)
        else:
            yield int(x)

df2 = (df
 .assign(**{'Unique_Identifier': df['Unique_Identifier']
                                   .apply(lambda x: list(torange(x)))})
 .explode('Unique_Identifier')
)

NB. the torange function only accepts integers, commas, and ranges in the form X-Y where X and Y are integers, else it will fail

output:

  Unique_Identifier  Score1  Score2
0               110      10      60
1               112      50      70
1               114      50      70
1               115      50      70
1               116      50      70
2               119      30      20
3               120      90      90
3               121      90      90
3               122      90      90
4               125      70      80
4               127      70      80

edit: previous more barbaric function:

def torange(s):
    return [y for x in s.split(',') for y in 
            (list(*(range(a, b 1) for a,b in [map(int, x.split('-'))]))
            if '-' in x else [int(x)])]
  • Related