Source Data is as below:
Name : ValueA
Age: 23
Height: 178cm
Friend : A
Friend : B
Name : ValueB
Age: 22
Height: 168cm
Weight: 80Kg
Friend : A
Friend : C
Name : ValueC
Age: 40
Height: 188cm
IQ: 150
Friend : D
Required Output
Name | Age | Height | Weight | IQ | Friend1 | Friend2 |
---|---|---|---|---|---|---|
ValueA | 23 | 178cm | NA | NA | A | B |
ValueB | 22 | 168cm | 80kg | NA | A | C |
ValueC | 40 | 188cm | NA | 150 | D | NA |
The code I am using is
with open("test.txt", "r") as f:
t = [line.strip() for line in f.readlines()]
# Organise data
stuff = {}
index = 0
for line in t:
key, value = line.split(": ")
if "Name" in key:
stuff[index] = {"Name": value}
current_key = index
index = 1
else:
stuff[current_key][key] = value
# create dataframe
result_df = pd.DataFrame(stuff).T
The issues is due to duplicate key values, "Friend" Only last value is being recognized.I am getting output as below:
Name | Age | Height | Weight | IQ | Friend |
---|---|---|---|---|---|
ValueA | 23 | 178cm | NA | NA | A |
ValueB | 22 | 168cm | 80kg | NA | C |
ValueC | 40 | 188cm | NA | 150 | D |
Please note there are other duplicate key values as well. Name is the primary key. The rest of them should be able to handle duplicate key values.
CodePudding user response:
To generalize my previous answer to any number of friends, we could store all the values of "Friend" in a list, and assign the elements of the "Friend" list to its own columns right before creating the dataframe:
Add this to your for loop to store "Friend"(s) in a list:
elif "Friend" in key: if "Friend" not in stuff[current_key]: stuff[current_key]["Friend"] = list() stuff[current_key]["Friend"].append(value)
After your for loop, to create a new key for each "Friend" element:
for key in stuff: for i, friend in enumerate(stuff[key]["Friend"], start=1): stuff[key][f"Friend{i}"] = friend del stuff[key]["Friend"]
Resulting code:
# Organise data
stuff = {}
index = 0
for line in t:
key, value = line.split(": ")
if "Name" in key:
stuff[index] = {"Name": value}
current_key = index
index = 1
elif "Friend" in key:
if "Friend" not in stuff[current_key]:
stuff[current_key]["Friend"] = list()
stuff[current_key]["Friend"].append(value)
else:
stuff[current_key][key] = value
for key in stuff:
for i, friend in enumerate(stuff[key]["Friend"]):
stuff[key][f"Friend{i}"] = friend
del stuff[key]["Friend"]
# create dataframe
result_df = pd.DataFrame(stuff).T
print(result_df)
Part 2: Generalization to any key
stuff = {}
index = 0
for line in t:
key, value = line.split(": ")
if "Name" in key:
stuff[index] = {"Name": [value]} # Always use a list as vale
current_key = index
index = 1
elif key not in stuff[current_key]: # If key does not exist
stuff[current_key][key] = [value] # Create key with value in a list
else:
stuff[current_key][key].append(value) # If key exists, append the value
for key in stuff:
for subkey in list(stuff[key].keys()):
for i, element in enumerate(stuff[key][subkey], start=1):
stuff[key][f"{subkey}{i}"] = element
del stuff[key][subkey]
# create dataframe
result_df = pd.DataFrame(stuff).T
print(result_df)
Output:
Name1 Age1 Height1 Friend 1 Friend 2 Weight1 IQ1
0 ValueA 23 178cm A B NaN NaN
1 ValueB 22 168cm A C 80Kg NaN
2 ValueC 40 188cm D NaN NaN 150
Old answer (not scalable)
You could add this elif
inside your loop. The idea is to identify the "Friend" key but store two possible keys ("Friend1" and "Friend2"). We only use "Friend2" if "Friend1" already exists:
elif "Friend" in key:
if "Friend1" not in stuff[current_key]:
stuff[current_key]["Friend1"] = value
elif "Friend2" not in stuff[current_key]:
stuff[current_key]["Friend2"] = value
Notice that, if a row only has "Friend1", pandas should fill "Friend2" with a NaN value automatically.
CodePudding user response:
To handle duplicate keys in Python, use a data structure that permits multiple values, like a dict of lists. For example:
# Organise data
stuff = {}
index = 0
for line in t:
key, value = line.split(": ")
if "Name" in key:
stuff[index] = {"Name": value}
current_key = index
index = 1
else:
stuff[current_key].setdefault(key, []).append(value)
In this code, setdefault sets the value for key
to an empty list (if not already set), and then you immediately append to it. This means that all keys (except Name) will have a list of values instead of a single value.
Another approach, which I prefer, is not to dictify things. You have a schema, so you can have a dataclass.
from dataclasses import dataclass
@dataclass
class Person:
name: str
age: int
height: str
weight: str
friends: list
iq: int
people = []
cur_person = None
for line in t:
key, value = [i.strip() for i in line.split(":", 1)]
if key == 'Name':
cur_person = Person(value, age=None, height=None, weight=None,
friends=[], iq=None)
people.append(cur_person)
elif key in {'Age', 'IQ'}:
setattr(cur_person, key.lower(), int(value))
elif key == 'Friend':
cur_person.friends.append(value)
else:
setattr(cur_person, key.lower(), value)
CodePudding user response:
You can make the value of the key to a list, if a key has no value you can set it to None
.
stuff = {}
index = 0
for line in t:
key, value = line.split(": ")
if "Name" in key:
stuff[index] = {"Name": value}
current_key = index
index = 1
else:
if key in list(stuff[current_key].keys()): #Check existing value
new_value = []
if type(stuff[current_key][key]) is list:
for x in stuff[current_key][key]:
new_value.append(x)
else:
new_value.append(stuff[current_key][key])
new_value.append(value)
stuff[current_key][key] = new_value
else: #No existing value
stuff[current_key][key] = value