I have a folder with many csv files. They all look similar, they all have the same names for columns and rows. They all have strings as values in their cells. I want to concatenate them along columns AND rows so that all the strings are concatenated into their respective cells.
Example:
file1.csv
0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|
b1 | peter | house | ash | plane |
b2 | carl | horse | paul | knife |
b3 | mary | apple | linda | carrot |
b4 | hank | car | herb | beer |
file2.csv
0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|
b1 | mark | green | hello | band |
b2 | no | phone | spoon | goodbye |
b3 | red | cherry | charly | hammer |
b4 | good | yes | ok | simon |
What I want is this result with no delimiter between the string values:
concatenated.csv
0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|
b1 | peter mark | house green | ash hello | plane band |
b2 | carl no | horse phone | paul spoon | knife goodbye |
b3 | mary red | apple cherry | linda charly | carrot hammer |
b4 | hank good | car yes | herb ok | beer simon |
And I don't know how to do this in pandas in a jupyter notebook.
I have tried a couple of things but all of them either kept a seperate set of rows or of columns.
CodePudding user response:
If this is your first dataframe:
df1_data = {
1 : ['peter', 'carl', 'mary', 'hank'],
2 : ['house', 'horse','apple', 'car']
}
df1 = pd.DataFrame(df1_data)
print(df1)
Output:
1 2
0 peter house
1 carl horse
2 mary apple
3 hank car
And this is your second dataframe:
df2_data = {
1 : ['mark', 'no', 'red', 'good'],
2 : ['green','phone','cherry','yes' ]
}
df2 = pd.DataFrame(df2_data)
print(df2)
Output:
1 2
0 mark green
1 no phone
2 red cherry
3 good yes
You can reach your requested dataframe like this:
df = pd.DataFrame()
df[1] = df1[1] ' ' df2[1]
df[2] = df1[2] ' ' df2[2]
print(df)
Output:
1 2
0 peter mark house green
1 carl no horse phone
2 mary red apple cherry
3 hank good car yes
Loop for csv files:
Now, if you have a lot of csv files with names like file1.csv
and file2.csv
and so on, you can save them all in d
like this:
d = {}
for i in range(1,#N):
d[i] = pd.read_csv('.../file' str(i) '.csv')
#N is the number of csv files. (because I started from 1, you have to add 1 to N)
And build the dataframe you want like this:
concatenated_df = pd.DataFrame()
for i in range(1,#N):
concatenated_df[i] = d[1].iloc[:,i] ' ' d[2].iloc[:,i] ...
#N is the number of columns here
CodePudding user response:
If performance is not an issue, you can use pandas.DataFrame.applymap
with pandas.Series.add
:
out = df1[[0]].join(df1.iloc[:, 1:].applymap(lambda v: f"{v} ").add(df2.iloc[:, 1:]))
Or, for a large dataset, you can use pandas.concat
with a listcomp :
out = (
df1[[0]]
.join(pd.concat([df1.merge(df2, on=0)
.filter(regex=f"{p}_\w").agg(" ".join, axis=1)
.rename(idx) for idx, p in enumerate(range(1, len(df1.columns)), start=1)],
axis=1))
)
Output :
print(out)
0 1 2 3 4
0 b1 peter mark house green ash hello plane band
1 b2 carl no horse phone paul spoon knife goodbye
2 b3 mary red apple cherry linda charly carrot hammer
3 b4 hank good car yes herb ok beer simon
CodePudding user response:
Reading many csv files into a single DF is a pretty common answer, and is the first part of your question. You can find a suitable answer here.
After that, in an effort to allow you to perform this on all of the files at the same time, you can melt and pivot with a custom agg function like so:
import glob import pandas as pd
# See the linked answer if you need help finding csv files in a different directory
all_files = glob.glob('*.csv'))
df = pd.concat((pd.read_csv(f) for f in all_files))
output = df.melt(id_vars='0')
.pivot_table(index='0',
columns='variable',
values='value',
aggfunc=lambda x: ' '.join(x))