Home > Enterprise >  pandas, transforming dataframe & combining duplicates with counts
pandas, transforming dataframe & combining duplicates with counts

Time:09-19

Current dataframe is as follows:

df = pd.read_csv('filename.csv', delimiter=',')

print(df)
idx   uniqueID      String        CaseType 

0        1           'hello1'      1.0
1        1           'hello2'      1.0
2        1           'goodbye1'    1.0
3        1           'goodbye2'    1.0
4        2           'hello1'      3.0
5        2           'hello2'      3.0
6        2           'hello3'      3.0
7        3           'goodbye1'    1.0
8        3           'goodbye2'    1.0
9        3           'goodbye3'    1.0
10       4           'hello'       2.0
11       4           'goodbye'     2.0

Expected Output: (Please note they are grouped based on uniqueID, and the case Type follows the last string of the uniqueID.)

idx   Source      Destination 

0        'hello1'     'hello2'
1        'hello2'     'goodbye1'
3        'goodbye1'   'goodbye2'
4        'goodbye2'   '1.0'
6        'hello1'     'hello2'
7        'hello2'     'hello3' 
8        'hello3'     '3.0'
10       'goodbye1'   'goodbye2'
11       'goodbye2'   'goodbye3'
12       'goodbye3'   '1.0'
13       'hello'      'goodbye'
14       'goodbye'    '2.0'

Question: How do I transform the pandas dataframe in this way?

Currently, I am iterating through every row in a for loop, and for each uniqueId, adding each string CaseType (at the end) to a list, then splitting up that list and adding it to a new dataframe. It is incredibly slow.

Following this, the next step is to get the total counts/occurences for each row of the output. Essentially, if there are duplicate rows of source:destination (ie, we have 3 rows of 'hello' 'goodbye', it would result in 1 row with 'hello' 'goodbye' with their count as the 3rd column)

Example:

Original: 
idx   Source      Destination  

0     'hello1'     'hello2'   
1     'hello2'     'hello3' 
2     'hello1'     'hello2' 
3     'hello4'     'goodbye'  

Expected Output:

idx   Source      Destination  Count

0     'hello1'     'hello2'    2
1     'hello2'     'hello3'    1
2     'hello4'     'goodbye'   1

I presume the first step is slightly more complex with pandas logic, and the next step is essentially just combining duplicates and getting their count, but I am new to pandas and not entirely sure how to do either. Thank you in advance.

CodePudding user response:

df = pd.DataFrame({'idx': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 'uniqueID': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4],
                   'String': ["'hello1'", "'hello2'", "'goodbye1'", "'goodbye2'", "'hello1'", "'hello2'", "'hello3'",
                              "'goodbye1'", "'goodbye2'", "'goodbye3'", "'hello'", "'goodbye'"],
                   'CaseType': [1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 2.0, 2.0]}).set_index('idx')

df = df.groupby('uniqueID').apply(
    lambda x: pd.DataFrame({'Source': x.String, 'Destination': x.String.shift(-1, fill_value=f'\'{x.CaseType.iloc[0]}\'')}))
print(df)
         Source Destination
idx                        
0      'hello1'    'hello2'
1      'hello2'  'goodbye1'
2    'goodbye1'  'goodbye2'
3    'goodbye2'       '1.0'
4      'hello1'    'hello2'
5      'hello2'    'hello3'
6      'hello3'       '3.0'
7    'goodbye1'  'goodbye2'
8    'goodbye2'  'goodbye3'
9    'goodbye3'       '1.0'
10      'hello'   'goodbye'
11    'goodbye'       '2.0'

CodePudding user response:

You could try the following:

df_res = df[["String"]].rename(columns={"String": "Source"})
df_res["Destination"] = (
    df.groupby("uniqueID")["String"].transform("shift", -1)
      .fillna(df["CaseType"])
      .astype("str")
)

Result for your sample is:

      Source Destination
0     hello1      hello2
1     hello2    goodbye1
2   goodbye1    goodbye2
3   goodbye2         1.0
4     hello1      hello2
5     hello2      hello3
6     hello3         3.0
7   goodbye1    goodbye2
8   goodbye2    goodbye3
9   goodbye3         1.0
10     hello     goodbye
11   goodbye         2.0

Regarding the second part: Try

counts = df_res.value_counts(["Source", "Destination"])

to get

Source    Destination
goodbye1  goodbye2       2
hello1    hello2         2
goodbye   2.0            1
goodbye2  1.0            1
          goodbye3       1
goodbye3  1.0            1
hello     goodbye        1
hello2    goodbye1       1
          hello3         1
hello3    3.0            1
dtype: int64
  • Related