Home > Software engineering >  How to associate repeated strings with values from a dictionary in a dataframe?
How to associate repeated strings with values from a dictionary in a dataframe?

Time:12-03

I'm trying to associate in a dataframe the values of a list of numbers with the respective strings. Here's the problem:

import pandas as pd
categories = {"key1":["string1", "string2", "string3"], "key2": ["string1", "str1", "str2"]}
strings= ["string1", "string2", "string3", "string1", "str1", "str2"]
numbers = [1,2,3,4,5,6]

array = []
expected_fields = []

#Creation of the dataframe with double rows, where the first is the key of categories
#and the second is the elements of the list present in the values of categories
for key, value in categories.items():
    array.extend([key]* len(value))
    expected_fields.extend(value)
    
arrays = [array ,expected_fields]

#Creation of the dataframe
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df =  pd.Series(dtype='float', index=index)

for key, values in categories.items():
    for value in values:
        for i in range(len(strings)):
            if strings[i] == value:
                df[key, value] = numbers[i] 
print(df)

Output:

key1  string1    4.0   <--------- 
      string2    2.0
      string3    3.0
key2  string1    4.0
      str1       5.0
      str2       6.0

Expected output:

key1  string1    1.0   <---------
      string2    2.0
      string3    3.0
key2  string1    4.0
      str1       5.0
      str2       6.0

The association is always going for the last element of the list due to the repeated string in strings. However I want the first element of numbers for the first repeated string and the following number for the second repeated string.

I could count the number of elements of the values of the dictionary categories for each key and perform an increment in the for loop correspondent to the strings and based on the lower and upper limit add an if inside that for loop, however I can't go for this approach due to technical limitations.

CodePudding user response:

import pandas as pd
categories = {"key1":["string1", "string2", "string3"], "key2": ["string1", "str1", "str2"]}
strings= ["string1", "string2", "string3", "string1", "str1", "str2"]
numbers = [1,2,3,4,5,6]

array = []
expected_fields = []

#Creation of the dataframe with double rows, where the first is the key of categories
#and the second is the elements of the list present in the values of categories
for key, value in categories.items():
    array.extend([key]* len(value))
    expected_fields.extend(value)
    
arrays = [array ,expected_fields]

#Creation of the dataframe
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df =  pd.Series(dtype='float', index=index)

strings_copy = strings.copy()
for key, values in categories.items():
    for value in values:
        for i in range(len(strings_copy)):
            if strings_copy[i] == value:
                strings_copy[i] = None
                df[key, value] = numbers[i]
                break
print(df)

Output:

key1  string1    1.0
      string2    2.0
      string3    3.0
key2  string1    4.0
      str1       5.0
      str2       6.0
dtype: float64

CodePudding user response:

Do you need a solution with pandas? How about this solution:

from collections import OrderedDict

categories = OrderedDict([("key1", ["string1", "string2", "string3"]), ("key2", ["string1", "str1", "str2"])])

def category_strings(ordered_dict):
    current_id = 1
    for key, strings in ordered_dict.items():
        for string in strings:
            yield current_id, key, string
            current_id  = 1
    
for id, key, string in category_strings(categories):
    print(id, key, string)

Output:

1 key1 string1
2 key1 string2
3 key1 string3
4 key2 string1
5 key2 str1
6 key2 str2
  • Related