Home > Mobile >  how to manage duplicate keys while reading text file in python
how to manage duplicate keys while reading text file in python

Time:03-03

Source Data is as below:

Name : ValueA
Age: 23
Height: 178cm
Friend : A
Friend : B
Name : ValueB
Age: 22
Height: 168cm
Weight: 80Kg
Friend : A
Friend : C
Name : ValueC
Age: 40
Height: 188cm
IQ: 150
Friend : D

Required Output

Name Age Height Weight IQ Friend1 Friend2
ValueA 23 178cm NA NA A B
ValueB 22 168cm 80kg NA A C
ValueC 40 188cm NA 150 D NA

The code I am using is

with open("test.txt", "r") as f:
    t = [line.strip() for line in f.readlines()]

# Organise data
stuff = {}
index = 0
for line in t:
    key, value = line.split(": ")
    if "Name" in key:
        stuff[index] = {"Name": value}
        current_key = index
        index  = 1
    else:
        stuff[current_key][key] = value

# create dataframe
result_df = pd.DataFrame(stuff).T

The issues is due to duplicate key values, "Friend" Only last value is being recognized.I am getting output as below:

Name Age Height Weight IQ Friend
ValueA 23 178cm NA NA A
ValueB 22 168cm 80kg NA C
ValueC 40 188cm NA 150 D

Please note there are other duplicate key values as well. Name is the primary key. The rest of them should be able to handle duplicate key values.

CodePudding user response:

To generalize my previous answer to any number of friends, we could store all the values of "Friend" in a list, and assign the elements of the "Friend" list to its own columns right before creating the dataframe:

  1. Add this to your for loop to store "Friend"(s) in a list:

    elif "Friend" in key:
        if "Friend" not in stuff[current_key]:
            stuff[current_key]["Friend"] = list()
        stuff[current_key]["Friend"].append(value)
    
  2. After your for loop, to create a new key for each "Friend" element:

    for key in stuff:
        for i, friend in enumerate(stuff[key]["Friend"], start=1):
            stuff[key][f"Friend{i}"] = friend
        del stuff[key]["Friend"]
    

Resulting code:

# Organise data
stuff = {}
index = 0
for line in t:
    key, value = line.split(": ")
    if "Name" in key:
        stuff[index] = {"Name": value}
        current_key = index
        index  = 1
    elif "Friend" in key:
        if "Friend" not in stuff[current_key]:
            stuff[current_key]["Friend"] = list()
        stuff[current_key]["Friend"].append(value)
    else:
        stuff[current_key][key] = value

for key in stuff:
    for i, friend in enumerate(stuff[key]["Friend"]):
        stuff[key][f"Friend{i}"] = friend
    del stuff[key]["Friend"]

# create dataframe
result_df = pd.DataFrame(stuff).T
print(result_df)

Part 2: Generalization to any key

stuff = {}
index = 0
for line in t:
    key, value = line.split(": ")
    if "Name" in key:
        stuff[index] = {"Name": [value]}  # Always use a list as vale
        current_key = index
        index  = 1
    elif key not in stuff[current_key]:  # If key does not exist
        stuff[current_key][key] = [value]  # Create key with value in a list
    else:
        stuff[current_key][key].append(value)  # If key exists, append the value

for key in stuff:
    for subkey in list(stuff[key].keys()):
        for i, element in enumerate(stuff[key][subkey], start=1):
            stuff[key][f"{subkey}{i}"] = element
        del stuff[key][subkey]

# create dataframe
result_df = pd.DataFrame(stuff).T
print(result_df)

Output:

    Name1 Age1 Height1 Friend 1 Friend 2 Weight1  IQ1
0  ValueA   23   178cm        A        B     NaN  NaN
1  ValueB   22   168cm        A        C    80Kg  NaN
2  ValueC   40   188cm        D      NaN     NaN  150

Old answer (not scalable)

You could add this elif inside your loop. The idea is to identify the "Friend" key but store two possible keys ("Friend1" and "Friend2"). We only use "Friend2" if "Friend1" already exists:

elif "Friend" in key:
    if "Friend1" not in stuff[current_key]:
        stuff[current_key]["Friend1"] = value
    elif "Friend2" not in stuff[current_key]:
        stuff[current_key]["Friend2"] = value

Notice that, if a row only has "Friend1", pandas should fill "Friend2" with a NaN value automatically.

CodePudding user response:

To handle duplicate keys in Python, use a data structure that permits multiple values, like a dict of lists. For example:

# Organise data
stuff = {}
index = 0
for line in t:
    key, value = line.split(": ")
    if "Name" in key:
        stuff[index] = {"Name": value}
        current_key = index
        index  = 1
    else:
        stuff[current_key].setdefault(key, []).append(value)

In this code, setdefault sets the value for key to an empty list (if not already set), and then you immediately append to it. This means that all keys (except Name) will have a list of values instead of a single value.

Another approach, which I prefer, is not to dictify things. You have a schema, so you can have a dataclass.

from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: int
    height: str
    weight: str
    friends: list
    iq: int

people = []
cur_person = None
for line in t:
    key, value = [i.strip() for i in line.split(":", 1)]
    if key == 'Name':
        cur_person = Person(value, age=None, height=None, weight=None,
                            friends=[], iq=None)
        people.append(cur_person)
    elif key in {'Age', 'IQ'}:
        setattr(cur_person, key.lower(), int(value))
    elif key == 'Friend':
        cur_person.friends.append(value)
    else:
        setattr(cur_person, key.lower(), value)

CodePudding user response:

You can make the value of the key to a list, if a key has no value you can set it to None.

stuff = {}
index = 0
for line in t:
    key, value = line.split(": ")
    if "Name" in key:
        stuff[index] = {"Name": value}
        current_key = index
        index  = 1
    else:
        if key in list(stuff[current_key].keys()): #Check existing value
            new_value = []
            if type(stuff[current_key][key]) is list:
                for x in stuff[current_key][key]:
                    new_value.append(x)
            else:
                new_value.append(stuff[current_key][key])
            new_value.append(value)
            stuff[current_key][key] = new_value
        else: #No existing value
            stuff[current_key][key] = value
  • Related