I have an issue merging 2 CSV into a new dataset. So far, I have merged them. Let's say I have csv1 and csv2. I created a 'df' from csv1 and I merged csv2 with a specific column. But here is my problem:
I need to create a new column in df (e.g. 'new_foo', because both CSV has the same column name, but different data, so it's not possible to merge the 2 foo columns) and relate the two columns from merged csv2 column X. This is because in csv1 the common column is X.
Current Input:
import pandas as pd
df = pd.read_csv(csv1)
csv2_df = pd.read_csv(csv2)
csv2_column_x=csv2_df['x']
new_foo=csv2_df['foo']
df['new_foo']
df = df.merge(csv2_column_x, how="outer").merge(new_foo, how="outer")
Current Output:
X | Y | foo | new_foo |
---|---|---|---|
1 | 'bla1' | 'qwer' | 12 |
1 | 'bla2' | 'aswd' | 34 |
1 | 'bla3' | 'asdf' | |
2 | 'bla4' | 'azer' |
Desired Output:
X | Y | foo | new_foo |
---|---|---|---|
1 | 'bla1' | 'qwer' | 12 |
1 | 'bla2' | 'aswd' | 12 |
1 | 'bla3' | 'asdf' | 12 |
2 | 'bla4' | 'azer' | 34 |
This is my csv1:
X | Y | foo |
---|---|---|
1 | 'bla1' | 'qwer' |
1 | 'bla2' | 'aswd' |
1 | 'bla3' | 'asdf' |
2 | 'bla4' | 'azer' |
This is my csv2:
X | foo |
---|---|
1 | 12 |
2 | 34 |
NOTE: The CSV files have different numbers of rows.
CodePudding user response:
You can use Pandas join method the dataframes on column X
.
Code:
import pandas as pd
csv1_df = pd.read_csv("csv1.csv")
csv2_df = pd.read_csv("csv2.csv")
merged = csv1_df.join(csv2_df.set_index('X'), on='X', lsuffix='_caller', rsuffix='_other')
print(merged)
Output:
X Y foo_caller foo_other
0 1 'bla1' 'qwer' 12
1 1 'bla2' 'aswd' 12
2 1 'bla3' 'asdf' 12
3 2 'bla4' 'azer' 34
csv1.csv
:
X,Y,foo
1,'bla1','qwer'
1,'bla2','aswd'
1,'bla3','asdf'
2,'bla4','azer'
csv2.csv
:
X,foo
1,12
2,34
References:
CodePudding user response:
I finally did what Arsho told me and it's working perfectly. Instead of using the .merge() method, I used the .join() method to concatenate the CSV as I desired.