I have a data frame containing hyponym and hypernym pairs extracted from StackOverflow posts. You can see an excerpt from it in the following:
0 1 2 3 4
linq query asmx web service THH 10 a linq query as an asmx web service
application bolt THH 1 my application is a bolt on data visualization...
area r time THH 1 the area of the square is r times
sql query syntax HTH 3 sql like query syntax
...
7379596 rows × 5 columns
The column 0
and the column 1
contain the hyponym and hyperonym parts of the phrases contained by the column 4
. I would like to implement a filter based on statistical features, therefore I have to count all occurrences of the pairs (0, 1)
columns together, all occurrences of the hyponym and hyperonym parts respectively. Pandas has a method called value_counts()
, so counting the occurrences can be obtained by:
df.value_counts([0])
df.value_counts([1])
df.value_counts([0, 1])
This is nice, but the method resulted in a Pandas Series
which has much fewer records than the original DataFrame
, therefore, adding a new column like df[5] = df.value_counts([0, 1])
does not work.
I have found a workaround: I have created 3 Pandas Series
for every occurrence type (pair
, hyponym
, hyperonym
) and I have written a small loop to calculate a confidence score for every pair but as the original dataset is huge (more than 7 million records) this calculation is not an efficient way to do that (the calculation has not finished after 30 hours). So, the feasible and hopefully efficient solution would be using the Pandas applymap()
for this purpose, but it is needed to attach columns containing the occurrences to the original DataFrame
. So I would like a DataFrame
like this one:
0 1 2 3 4 5 6 7
sql query anything anything a phrase 1000 800 500
sql query anything anything anotherphrase 1000 800 500
...
The column 5
is the occurences of the hyponym part (sql
), the column 6
is the number of occurrences of the hyperonym part (query
) and the column 7
is the occurrences of the pair (sql
,
query
). As you can see the pairs are the same but they are extracted from different phrases.
My question is how to do that? How can I attach occurrences as a new column to an existing DataFrame
?
CodePudding user response:
Here's a solution on how to map the value counts of the combination of two columns to a new column:
# Create an example DataFrame
df = pd.DataFrame({0: ["a", "a", "a", "b"], 1: ["c", "d", "d", "d"]})
# Count the paired occurrences in a new column
df["count"] = df.groupby([0,1])[0].transform('size')
Before editing, I had answered this question with a solution using value_counts and a merge. This original solution is slower and more complicated than the groupby:
# Put the value_counts in a new DataFrame, call them count
vcdf = pd.DataFrame(df[[0, 1]].value_counts(), columns=["count"])
# Merge the df with the vcs
merged = pd.merge(left=df, right=vcdf, left_on=[0, 1], right_index=True)
# Potentially sort index
merged = merged.sort_index()
The resulting DataFrame:
0 1 count
0 a c 1
1 a d 2
2 a d 2
3 b d 1