Home > Back-end >  Join rows and concatenate attribute values in a csv file with pandas
Join rows and concatenate attribute values in a csv file with pandas

Time:01-31

I have a csv file structured like this:

enter image description here

As you can see, many lines are repeated (they represent the same entity) with the attribute 'category' being the only difference between each other. I would like to join those rows and include all the categories in a single value.

For example the attribute 'category' for Walmart should be: "Retail, Dowjones, SuperMarketChains".

Edit:

I would like the output table to be structured like this:

enter image description here

CodePudding user response:

Quick and Dirty

df2=df.groupby("Name")['Category'].apply(','.join)

subst=dict(df2)
df['category']=df['Name'].replace(subst)
df.drop_duplicates('Name')

if you prefer multiple categories to be stored as a list in pandas column category... change first line to

df2=df.groupby("Name")['Category'].apply(list)

CodePudding user response:

Not sure if you want a new table or just a list of the categories. Below is how you could make a table with the hashes if those are important

import pandas as pd
df = pd.DataFrame({
    'Name':['W','W','W','A','A','A'],
    'Category':['Retail','Dow','Chain','Ecom','Internet','Dow'],
    'Hash':[1,2,3,4,5,6],
})

# print(df)
#   Name  Category  Hash
# 0    W    Retail     1
# 1    W       Dow     2
# 2    W     Chain     3
# 3    A      Ecom     4
# 4    A  Internet     5
# 5    A       Dow     6

#Make a new df which has one row per company and one column per category, values are hashes
piv_df = df.pivot(
    index = 'Name',
    columns = 'Category',
    values = 'Hash',
)

# print(piv_df)
# Category  Chain  Dow  Ecom  Internet  Retail
# Name                                        
# A           NaN  6.0   4.0       5.0     NaN
# W           3.0  2.0   NaN       NaN     1.0
  • Related