Merge 2 dataset python (Closed)-CodePudding

I have an issue merging 2 CSV into a new dataset. So far, I have merged them. Let's say I have csv1 and csv2. I created a 'df' from csv1 and I merged csv2 with a specific column. But here is my problem:

I need to create a new column in df (e.g. 'new_foo', because both CSV has the same column name, but different data, so it's not possible to merge the 2 foo columns) and relate the two columns from merged csv2 column X. This is because in csv1 the common column is X.

Current Input:

import pandas as pd

df = pd.read_csv(csv1)
csv2_df = pd.read_csv(csv2)
csv2_column_x=csv2_df['x']
new_foo=csv2_df['foo']
df['new_foo']
df = df.merge(csv2_column_x, how="outer").merge(new_foo, how="outer")

Current Output:

X	Y	foo	new_foo
1	'bla1'	'qwer'	12
1	'bla2'	'aswd'	34
1	'bla3'	'asdf'
2	'bla4'	'azer'

Desired Output:

X	Y	foo	new_foo
1	'bla1'	'qwer'	12
1	'bla2'	'aswd'	12
1	'bla3'	'asdf'	12
2	'bla4'	'azer'	34

This is my csv1:

X	Y	foo
1	'bla1'	'qwer'
1	'bla2'	'aswd'
1	'bla3'	'asdf'
2	'bla4'	'azer'

This is my csv2:

X	foo
1	12
2	34

NOTE: The CSV files have different numbers of rows.

CodePudding user response：

You can use Pandas join method the dataframes on column X.

Code:

import pandas as pd

csv1_df = pd.read_csv("csv1.csv")
csv2_df = pd.read_csv("csv2.csv")
merged = csv1_df.join(csv2_df.set_index('X'), on='X', lsuffix='_caller', rsuffix='_other')
print(merged)

Output:

   X       Y foo_caller  foo_other
0  1  'bla1'     'qwer'         12
1  1  'bla2'     'aswd'         12
2  1  'bla3'     'asdf'         12
3  2  'bla4'     'azer'         34

csv1.csv:

X,Y,foo
1,'bla1','qwer'
1,'bla2','aswd'
1,'bla3','asdf'
2,'bla4','azer'

csv2.csv:

X,foo
1,12
2,34

References:

Pandas documentation on join

CodePudding user response：

I finally did what Arsho told me and it's working perfectly. Instead of using the .merge() method, I used the .join() method to concatenate the CSV as I desired.