Current dataframe is as follows:
df = pd.read_csv('filename.csv', delimiter=',')
print(df)
idx uniqueID String CaseType
0 1 'hello1' 1.0
1 1 'hello2' 1.0
2 1 'goodbye1' 1.0
3 1 'goodbye2' 1.0
4 2 'hello1' 3.0
5 2 'hello2' 3.0
6 2 'hello3' 3.0
7 3 'goodbye1' 1.0
8 3 'goodbye2' 1.0
9 3 'goodbye3' 1.0
10 4 'hello' 2.0
11 4 'goodbye' 2.0
Expected Output: (Please note they are grouped based on uniqueID, and the case Type follows the last string of the uniqueID.)
idx Source Destination
0 'hello1' 'hello2'
1 'hello2' 'goodbye1'
3 'goodbye1' 'goodbye2'
4 'goodbye2' '1.0'
6 'hello1' 'hello2'
7 'hello2' 'hello3'
8 'hello3' '3.0'
10 'goodbye1' 'goodbye2'
11 'goodbye2' 'goodbye3'
12 'goodbye3' '1.0'
13 'hello' 'goodbye'
14 'goodbye' '2.0'
Question: How do I transform the pandas dataframe in this way?
Currently, I am iterating through every row in a for loop, and for each uniqueId, adding each string CaseType (at the end) to a list, then splitting up that list and adding it to a new dataframe. It is incredibly slow.
Following this, the next step is to get the total counts/occurences for each row of the output. Essentially, if there are duplicate rows of source:destination (ie, we have 3 rows of 'hello' 'goodbye', it would result in 1 row with 'hello' 'goodbye' with their count as the 3rd column)
Example:
Original:
idx Source Destination
0 'hello1' 'hello2'
1 'hello2' 'hello3'
2 'hello1' 'hello2'
3 'hello4' 'goodbye'
Expected Output:
idx Source Destination Count
0 'hello1' 'hello2' 2
1 'hello2' 'hello3' 1
2 'hello4' 'goodbye' 1
I presume the first step is slightly more complex with pandas logic, and the next step is essentially just combining duplicates and getting their count, but I am new to pandas and not entirely sure how to do either. Thank you in advance.
CodePudding user response:
df = pd.DataFrame({'idx': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 'uniqueID': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4],
'String': ["'hello1'", "'hello2'", "'goodbye1'", "'goodbye2'", "'hello1'", "'hello2'", "'hello3'",
"'goodbye1'", "'goodbye2'", "'goodbye3'", "'hello'", "'goodbye'"],
'CaseType': [1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 2.0, 2.0]}).set_index('idx')
df = df.groupby('uniqueID').apply(
lambda x: pd.DataFrame({'Source': x.String, 'Destination': x.String.shift(-1, fill_value=f'\'{x.CaseType.iloc[0]}\'')}))
print(df)
Source Destination
idx
0 'hello1' 'hello2'
1 'hello2' 'goodbye1'
2 'goodbye1' 'goodbye2'
3 'goodbye2' '1.0'
4 'hello1' 'hello2'
5 'hello2' 'hello3'
6 'hello3' '3.0'
7 'goodbye1' 'goodbye2'
8 'goodbye2' 'goodbye3'
9 'goodbye3' '1.0'
10 'hello' 'goodbye'
11 'goodbye' '2.0'
CodePudding user response:
You could try the following:
df_res = df[["String"]].rename(columns={"String": "Source"})
df_res["Destination"] = (
df.groupby("uniqueID")["String"].transform("shift", -1)
.fillna(df["CaseType"])
.astype("str")
)
Result for your sample is:
Source Destination
0 hello1 hello2
1 hello2 goodbye1
2 goodbye1 goodbye2
3 goodbye2 1.0
4 hello1 hello2
5 hello2 hello3
6 hello3 3.0
7 goodbye1 goodbye2
8 goodbye2 goodbye3
9 goodbye3 1.0
10 hello goodbye
11 goodbye 2.0
Regarding the second part: Try
counts = df_res.value_counts(["Source", "Destination"])
to get
Source Destination
goodbye1 goodbye2 2
hello1 hello2 2
goodbye 2.0 1
goodbye2 1.0 1
goodbye3 1
goodbye3 1.0 1
hello goodbye 1
hello2 goodbye1 1
hello3 1
hello3 3.0 1
dtype: int64