Parsing the data on the table-CodePudding

I have a table in which one column contains a range of values, and another column contains the value associated with the range of values.

0-0.26      3
0.27-0.78   2.9
0.79-1.31   2.8
1.32-1.83   2.7
1.84-2.36   2.6
2.37-2.88   2.5
2.89-3.41   2.4
3.42-3.93   2.3
3.94-4.46   2.2
4.47-4.98   2.1
4.99-5.51   2
5.52-6.03   1.9
6.04-6.56   1.8
6.57-7.08   1.7
7.09-7.61   1.6
7.62-8.13   1.5
8.14-8.66   1.4
8.67-9.18   1.3
9.19-9.71   1.2
9.72-10     1.1
>10         1

I want to create a method to accept a value that lies inside the range of values and return the corresponding value in column 2.

I can implement this using if-else, but my column is lengthy, and I have multiple tables containing similar data sets.

I'm trying to build a solution using Python with nearly zero experience. Can anyone guide me in the right direction?

CodePudding user response：

You can use this, if there is no requirement to the time complexity(with a binary search this will be much faster). This approach uses a dict to store the 2nd column as the key and a tuple to store the range associated with the key. It does some string manipulation to acquire the floats from the string in the first column. Iterates through all key - range pairs to compare the values.

def string_to_tuple(dictionary, string, value):  # string: your range, value: 2 column

    if ">" in string:

        dictionary[value] = (float(string.replace(">","")), None)
        return
    if "<" in string:
        dictionary[value] = (None, float(string.replace("<","")))
        return
    value_range = string.split('-')
    dictionary[value] = (float(value_range[0]), float(value_range[1]))


def get_key_of_value(_dict, value): # value: value to get the key for
    for key in _dict.keys():
        if check_if_boundaries(_dict[key], value):
            return key
    return None  # if not found


def check_boundaries(tuple, value):
    if tuple[0] == None:
        if value < tuple[1]:
            return True
    if tuple[1] == None:
        if value > tuple[0]:
            return True
    if tuple[0] <= value and tuple[1] >= value:
        return True
    return False


if __name__ == '__main__':
    _dict = {}
    x = [["8.14-8.66", "1.4"], ["8.67-9.18", "1.3"],[">10", "1"]]
    for entry in x:
        string_to_tuple(_dict, entry[0], entry[1])

    print(get_range(_dict, 11))# your number to check

CodePudding user response：

Here is a way to do it efficiently.

First, we need to modify your conditions to make them precise (e.g., how "closed" are intervals, etc.). In the following, we transform the string of column 'interval' into a left bound. Our convention will be that each interval is between the left bound (inclusive) and the following next bound (exclusive) or inf for the last interval:

df['left'] = df['interval'].replace('>10', '10-').apply(lambda s: float(s.split('-')[0]))
>>> df.tail()
     interval  value   left
16  8.14-8.66    1.4   8.14
17  8.67-9.18    1.3   8.67
18  9.19-9.71    1.2   9.19
19    9.72-10    1.1   9.72
20        >10    1.0  10.00

Then, we can use cut to efficiently sift a bunch of values into bins, and then use the desired value as label:

# example
np.random.seed(0)
df2 = pd.DataFrame({'x': np.random.uniform(0, 12, 10)})

# now, do the mapping:
df2['value'] = pd.cut(
    df2['x'],
    bins=df['left'].tolist()   [float('Inf')],
    labels=df['value'],
    right=False
).astype(float)
>>> df2
           x  value
0   6.585762    1.7
1   8.582272    1.4
2   7.233161    1.6
3   6.538598    1.8
4   5.083858    2.0
5   7.750729    1.5
6   5.251047    2.0
7  10.701276    1.0
8  11.563953    1.0
9   4.601298    2.1

CodePudding user response：

Your data suggests continuous intervals, so just use a fall-through lookup, based on what each row ends on: 5.2 is greater than the end values for each row until we get to the row that ends on 5.51, so we return that row's second column value. We also observe that >10 just means greater than the end value of the last interval, so we don't need that row "as a row", we just need to use its value if an input didn't match any of the real rows.

So: first you remove the last row and capture its return value, then you turn the left column "range" values into just the end values, and then you just run through the list to match an input to an output:

# Remove the last row and capture its value
default_value = your_data.pop()[1]

# Get a mapped list of end values given your data
end_values = [float(r.split("-")[1]) for r in your_data]
row_indices = range(0, len(end_values))

def find_mapped_value(input):
  for i in row_indices:
    if input > table[i]:
      return your_data[i][1]
  return default_value

And note that this will behave in an "undefined" manner for values between your intervals, because your table doesn't specify what should happen for things like 0.265 for example. If your data is only ever rounded to 2 decimal places then that's not an issue, but then you'll have to make sure to catch violations of that assumption either earlier in the code, or as test at the start of this function.