Home > Back-end >  Concatenate strings in multiple csv files into one datafram along x and y axis (in pandas)
Concatenate strings in multiple csv files into one datafram along x and y axis (in pandas)

Time:01-15

I have a folder with many csv files. They all look similar, they all have the same names for columns and rows. They all have strings as values in their cells. I want to concatenate them along columns AND rows so that all the strings are concatenated into their respective cells.

Example:

file1.csv

0 1 2 3 4
b1 peter house ash plane
b2 carl horse paul knife
b3 mary apple linda carrot
b4 hank car herb beer

file2.csv

0 1 2 3 4
b1 mark green hello band
b2 no phone spoon goodbye
b3 red cherry charly hammer
b4 good yes ok simon

What I want is this result with no delimiter between the string values:

concatenated.csv

0 1 2 3 4
b1 peter mark house green ash hello plane band
b2 carl no horse phone paul spoon knife goodbye
b3 mary red apple cherry linda charly carrot hammer
b4 hank good car yes herb ok beer simon

And I don't know how to do this in pandas in a jupyter notebook.

I have tried a couple of things but all of them either kept a seperate set of rows or of columns.

CodePudding user response:

If this is your first dataframe:

df1_data = {
    1 : ['peter', 'carl', 'mary', 'hank'],
    2 : ['house', 'horse','apple', 'car']
}
df1 = pd.DataFrame(df1_data)
print(df1)

Output:

       1      2
0  peter  house
1   carl  horse
2   mary  apple
3   hank    car

And this is your second dataframe:

df2_data = {
    1 : ['mark', 'no',   'red',   'good'],
    2 : ['green','phone','cherry','yes' ]
}
df2 = pd.DataFrame(df2_data)
print(df2)

Output:

      1       2
0  mark   green
1    no   phone
2   red  cherry
3  good     yes

You can reach your requested dataframe like this:

df = pd.DataFrame()
df[1] = df1[1]   ' '  df2[1]
df[2] = df1[2]   ' '  df2[2]
print(df)

Output:

            1             2
0  peter mark   house green
1     carl no   horse phone
2    mary red  apple cherry
3   hank good       car yes

Loop for csv files:

Now, if you have a lot of csv files with names like file1.csv and file2.csv and so on, you can save them all in d like this:

d = {}
for i in range(1,#N): 
  d[i] = pd.read_csv('.../file' str(i) '.csv')
#N is the number of csv files. (because I started from 1, you have to add 1 to N)

And build the dataframe you want like this:

concatenated_df = pd.DataFrame()

for i in range(1,#N):
  concatenated_df[i] = d[1].iloc[:,i]   ' '   d[2].iloc[:,i]   ...
#N is the number of columns here

CodePudding user response:

If performance is not an issue, you can use pandas.DataFrame.applymap with pandas.Series.add :

out = df1[[0]].join(df1.iloc[:, 1:].applymap(lambda v: f"{v} ").add(df2.iloc[:, 1:]))

Or, for a large dataset, you can use pandas.concat with a listcomp :

out = (
        df1[[0]]
            .join(pd.concat([df1.merge(df2, on=0)
                                 .filter(regex=f"{p}_\w").agg(" ".join, axis=1)
                                 .rename(idx) for idx, p in enumerate(range(1, len(df1.columns)), start=1)],
                            axis=1))
     )

Output : ​

print(out)

    0           1             2             3              4
0  b1  peter mark   house green     ash hello     plane band
1  b2     carl no   horse phone    paul spoon  knife goodbye
2  b3    mary red  apple cherry  linda charly  carrot hammer
3  b4   hank good       car yes       herb ok     beer simon

CodePudding user response:

Reading many csv files into a single DF is a pretty common answer, and is the first part of your question. You can find a suitable answer here.

After that, in an effort to allow you to perform this on all of the files at the same time, you can melt and pivot with a custom agg function like so:

import glob import pandas as pd

# See the linked answer if you need help finding csv files in a different directory
all_files = glob.glob('*.csv'))
df = pd.concat((pd.read_csv(f) for f in all_files))


output = df.melt(id_vars='0')
           .pivot_table(index='0', 
                        columns='variable',
                        values='value',
                        aggfunc=lambda x: ' '.join(x))
  • Related