I have a DataFrame that is the result of a large SQL query. I am trying to sort the DataFrame into 2 separate DataFrames. NVI and Main. They are both a list of repairs to trucks. I need to sort it based on if there is a specific profile id which is 7055. Which will go into the NVI DataFrame
If that job is encountered I need to grab the values from the "RO" "Unit Number" and Repair Date column. I then need to take those values and search the DataFrame again and grab any rows that have a matching RO and Unit number or a matching Unit number and a Repair date that is equal to or earlier than the date value in the the row that the 7055 was found. Those rows then need to go into the NVI df. Any remaining rows that do not match will go into the Main df.
The only static value is the profile id of 7055. The RO Unit Number and Repair date will all be different.
class nvi_dict(dict):
def __setitem__(self, key, value):
key = key.profile()
super().__setitem__(key, value)
nvisort = pd.DataFrame()
def sort_nvi_dict(row, component):
if row ['PROFILE_ID'] in cfg[component]['nvi']:
nvi_ro = nvi_dict()
nvi_ro ['RO'] = row ['RO']
nvi_ro ['UnitNum'] = row ['VFUNIT']
nvi_ro ['date']= row['REPAIR_DATE']
nvisort = nvidf.apply(lambda x: sort_nvi_dict(x, 'nvi_ro'), axis=1, result_type='expand')
I thought about trying to use a class to create a temp dict object to store the values from RO, UnitNum and Date. Which I can then call on to iterate over the df again looking for matching values.
I am using a .yml file to store dictionaries. That I am using to further sort each of the NVI and Main df's after they have been sorted out. Because they will then need to each be sorted by truck manufacturer
CodePudding user response:
From what i understand of your question, you want to divide a dataframe into 2 based on certain conditions?
df1 = df[<condition>]
condition can be - df[profile id] == 7055 and Allunits.contains(df[unit])
CodePudding user response:
I think this might work, unable to test without the test data though...
df1 = nvisort[nvisort['profile_id'] = 7055]
df2 = pd.merge(nvisort,df1[['RO','Unit Number']],on=['RO','Unit number'],how='right')
df3 = pd.merge(nvisort,df1[['Unit Number','Repair Date']],on='Unit Number'],how='right')
df3 = df3[df3['Repair Date_x'] <= df3['Repair Date_y']]
df3 = df3.drop(columns='Repair Date_y']
df3 = df3.rename(columns={'Repair Date_x':'Repair Date'})
NVI = pd.concat([df1,df2,df3])
Main = pd.concat([NVI,nvisort]).drop_duplicates(keep=False)
I'm assuming that your original/starting dataframe here is the nvisort
, and then we filter that just to get profile_id of 7055 and call that df1
Then we are going to get your two different pieces of criteria into df2 and df3.
df2 is just a filter on the original dataframe where RO and Unit Number match, so we can use pd.merge()
to effectively get that filter.
df3 is a more complicated filter since it is the less than or equal, not the equal. So first we do the merge to filter on matching unit numbers, but we also bring over the Repair Date from both tables into df3, and these get appended _x
and _y
on the column names. So then we filter where the date on the _x
is less than on _y
and then clean it up.
Last, you get Main by finding everything from the original nvisort that is not in NVI. Since NVI is a subset of nvisort, you can just concat them and drop all duplicates, leaving only data that exists in one of the dataframes.