I have a pandas dataframe that looks something like this:
id | col1 | col2 | value1 | value2 | value3 |
---|---|---|---|---|---|
1 | 123456 | 1234ABC | 1 | 2 | nan |
1 | 123456 | 1234567 | 1 | 2 | nan |
1 | 124567 | 1234568 | 1 | 2 | nan |
1 | 124567 | 2345678 | nan | 2 | nan |
2 | 123456 | 1234564 | nan | 2 | nan |
2 | 123456 | 2132534 | nan | 2 | nan |
2 | 543210 | 10580701 | nan | 2 | nan |
I want to make a function that runs through the whole set and cleans it with these conditions:
For every unique id, do the following steps:
- If col 1 has 6 digit code and col 2 has number and letter combination:
- Then keep row
- If col 1 has 6 digit code and col 2 has something else than number and letter combination
- Then keep the first row with same 6 digit code in col1.
So in this table example, after running the function, these rows would still be in the dataset:
id | col1 | col2 | value1 | value2 | value3 |
---|---|---|---|---|---|
1 | 123456 | 1234ABC | 1 | 2 | nan |
1 | 123456 | 1234567 | 1 | 2 | nan |
1 | 124567 | 2345678 | nan | 2 | nan |
2 | 123456 | 1234564 | nan | 2 | nan |
2 | 543210 | 10580701 | nan | 2 | nan |
At first i tried something like this:
def process_df(df):
# Sort the dataframe by column 1 and column 2
df = df.sort_values(by=['col1', 'col2'])
# Create a new column that indicates whether a row has a letter in column 2
df['has_letter'] = df['col2'].str.contains('[a-zA-Z]')
# Group the dataframe by column 1 and apply the following function to each group
def group_func(group):
# If there are any rows with a letter in column 2, keep all of them
if group['has_letter'].any():
return group
# If there are no rows with a letter in column 2, keep the first row
else:
return group.iloc[0:1]
df = df.groupby('col1').apply(group_func)
# Drop the has_letter column
df = df.drop(columns=['has_letter'])
df=df.reset_index(drop=True)
return df
But it didn't work since every unique id might have rows where col1 6 digit code is same than some other ids col1 6 digit code
So somehow I have to make a function that does this to every id separately so it would work.
EDIT:
I edited the
df = df.groupby('col1').apply(group_func)
row to
df = df.groupby(['id', 'col1']).apply(group_func)
This seems? to do the job.
CodePudding user response:
First I grouped by id, then I extracted the rows that had letters in col2, Iterating on the groups I further grouped by col1 and from these I extracted the first occurrence of the col1 value. I hope this can help you.
# Sort the dataframe by column 1 and column 2
df4 = df4.sort_values(by=['col1', 'col2'])
# Create a new column that indicates whether a row has a letter in column 2
df4['has_letter'] = df4['col2'].str.contains('[a-zA-Z]')
# Fillna
df4['has_letter'] = df4['has_letter'].fillna(False)
grouped_id = df4.groupby('id')
output_df = pd.DataFrame()
# Iterate over group_id
for name, group_id in grouped_id:
output_df = output_df.append(group_id[group_id['has_letter']])
no_col2_letter = group_id[group_id['has_letter']==False]
grouped_col2 = no_col2_letter.groupby('col1')
for name, group_col2 in grouped_col2:
output_df = output_df.append(group_col2[:1])
output_df