Solved: Thanks to the help of @mozway I solved my problem. Basically merged the df's, grouped them up and combined them into one column deleted the duplicates.
df2 = df2.merge(df1, how='left', left_on=df2["code2_firstTwoDigits"], right_on=df1["code"])
df2['code'] = df2.groupby(['code2_firstTwoDigits'])['code'].transform(lambda x : ', '.join(x))
df2= df2.drop_duplicates()
I want to execute an operation row wise but I read, as a Python newby, that I shouldn't use loops in Python.
I have two dataframes which basically look like this
df1 =
code random_values code_Values
0 1234 house 3.21
1 2345 car 5.32
2 2342 white 1.00
3 2344 wolf 2.23
4 2345 black 5.00
5 3456 house 5.99
6 3422 car 3.21
7 3666 white 5.32
8 4567 wolf 1.00
9 4511 black 2.23
10 4721 red 5.00
df2 =
code2 random_values code2_firstTwoDigits
0 1264 house 12
1 2397 car 23
2 3498 white 34
3 4588 wolf 45
4 6388 black 63
I'm probably trying to do too many operations at once and I have a feeling that my brain didn't adjust to Python coming from C#. But what I'm trying to do is:
- Create columns "SameCodeStructure" and "SameCodeStructure_SumOfValues" in df2
- Iterate through all rows of df2 and search for values in df1["code"] which match with the value of "code2_firstTwoDigits"
- Save those values from df1["code"] in df2["SameCodeStructure"]
- Add the value from those matching rows into df2["SameCodeStructure_SumOfValues"]
It's hard to explain without getting confused again, so I tried a visual representation of my question how I would love it to look like:
df3 =
code2 random_values code2_firstTwoDigits SameCodeStructure SameCodeStructure_SumOfValues
0 1264 house 12 1234 3.21
1 2397 car 23 2345, 2342, 2344, 2345 13.55
2 3498 white 34 3456, 3422 9.2
3 4588 wolf 45 4567, 4511 3.23
4 6388 black 63 - -
I managed to look up the matching codes for a single row (3), but I'm struggling to find a helping hand how to transform this into an overall solution for all rows of df2
pd.Series(df2[ df2["code2"].str.match (df1.iloc[3]["code_Values"] ) ]["code2"]).str.cat(sep=", ", others=None)
And I'm thankful for all the advice. Python seems to be easier in terms of "the amount of written code" for me, but it's really difficult to think "in less code"
CodePudding user response:
The exact logic of the output is unclear, but you want merge
and groupby
agg
:
(df1.merge(df2, left_on=df1['code']//100, right_on='code2_firstTwoDigits', suffixes=('', '_2'))
.groupby('code2_firstTwoDigits')
.agg({'code_Values': 'sum',
'code': list,
'random_values': 'first'})
)
output:
code_Values code random_values
code2_firstTwoDigits
12 3.21 [1234] house
23 13.55 [2345, 2342, 2344, 2345] car
34 9.20 [3456, 3422] house
45 3.23 [4567, 4511] wolf