I have a pandas dataframe
which is generated based on events. each event has an unique ID and it generates repeated rows in the dataframe.
The problem is that some of these repeated rows contains random values whih they are different from each other.
I need to replace values in the columns( Name, Age Occupation)
based on the most frequent one per event_id.
also the salary column has trailing hyphen needed to remove that as well
Thanks in advance
input data
print(df)
ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 23 13414.12
2 1_a Feb NaN teacher 13414.12
3 1_a Mar ___ 13414.12
4 1_a Apr andrew 23 teacher 13414.12
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42_
7 1_b Feb #$%6 scientist 1975.42
8 1_b Mar Ash 45 ^#3a2g4 1975.42
9 1_b Apr Ash 45 scientist 1975.42
Desired output :
print(df)
ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 24 principle 25000
2 1_a Feb andrew 24 principle 25000
3 1_a Mar andrew 24 principle 25000
4 1_a Apr andrew 24 principle 25000
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42
7 1_b Feb Ash 45 scientist 1975.42
8 1_b Mar Ash 45 scientist 1975.42
9 1_b Apr Ash 45 scientist 1975.42
CodePudding user response:
First I had to create the DataFrame, unfortunately, I couldn't split the values from a raw_string with blank spaces, but in your dataframe, that shouldn't be a problem.
Ok, now the logic:
The code creates a list with unique values of the events, then I iterate for the columns of each event. With collections I can get a dictionary to count the frequency of the values in the filtered event column, and with the most frequent I set up the others.
That only won't work if your table has more repeated junk than good values. For example: If you have 30 junk values in a column filtered by event, but only the good one is repeated 2x, then the good one will be the replaced value.
If you have 30 junk values in a column filtered by event, but the good one appears only one time, then a random junk will be your replaced value.
Here is the code:
import pandas as pd
import collections
data = """ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 23 - 13414.12
2 1_a Feb - NA teacher 13414.12
3 1_a Mar ___ - z 13414.12
4 1_a Apr andrew 23 teacher 13414.12
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42_
7 1_b Feb #$%6 - scientist 1975.42
8 1_b Mar Ash 45 ^#3a2g4 1975.42
9 1_b Apr Ash 45 scientist 1975.42"""
data = data.split('\n')[1:]
for i in range(len(data)):
data[i] = data[i].split()
df = pd.DataFrame(data, columns=['ID', 'event_id','Month', 'Name', 'Age', 'Occupation', 'Salary'])
print(df)
print('\n')
events = set([x for x in df['event_id']])
columns = ['Name', 'Age', 'Occupation', 'Salary']
for event in events:
print(df.loc[df['event_id'] == event])
for column in columns:
counter = collections.Counter(df.loc[df['event_id'] == event][column])
print(df.loc[df['event_id'] == event][column])
print()
new_value = max(counter, key=counter.get)
for i in df.loc[df['event_id'] == event][column].index.tolist():
df[column][i] = new_value
print(df)
Output:
ID event_id Month Name Age Occupation Salary
0 1 1_a Jan andrew 23 teacher 13414.12
1 2 1_a Feb andrew 23 teacher 13414.12
2 3 1_a Mar andrew 23 teacher 13414.12
3 4 1_a Apr andrew 23 teacher 13414.12
4 5 1_a May andrew 23 teacher 13414.12
5 6 1_b Jan Ash 45 scientist 1975.42
6 7 1_b Feb Ash 45 scientist 1975.42
7 8 1_b Mar Ash 45 scientist 1975.42
8 9 1_b Apr Ash 45 scientist 1975.42
Process finished with exit code 0