Home > Software design >  How to attach a column containing the number of occurrences of values in other columns to an existin
How to attach a column containing the number of occurrences of values in other columns to an existin

Time:09-22

I have a data frame containing hyponym and hypernym pairs extracted from StackOverflow posts. You can see an excerpt from it in the following:

0           1                   2       3   4
linq query  asmx web service    THH     10  a linq query as an asmx web service
application bolt                THH     1   my application is a bolt on data visualization...
area        r time              THH     1   the area of the square is r times
sql         query syntax        HTH     3   sql like query syntax

...

7379596 rows × 5 columns

The column 0 and the column 1 contain the hyponym and hyperonym parts of the phrases contained by the column 4. I would like to implement a filter based on statistical features, therefore I have to count all occurrences of the pairs (0, 1) columns together, all occurrences of the hyponym and hyperonym parts respectively. Pandas has a method called value_counts(), so counting the occurrences can be obtained by:

df.value_counts([0])
df.value_counts([1])
df.value_counts([0, 1])

This is nice, but the method resulted in a Pandas Series which has much fewer records than the original DataFrame, therefore, adding a new column like df[5] = df.value_counts([0, 1]) does not work.

I have found a workaround: I have created 3 Pandas Series for every occurrence type (pair, hyponym, hyperonym) and I have written a small loop to calculate a confidence score for every pair but as the original dataset is huge (more than 7 million records) this calculation is not an efficient way to do that (the calculation has not finished after 30 hours). So, the feasible and hopefully efficient solution would be using the Pandas applymap() for this purpose, but it is needed to attach columns containing the occurrences to the original DataFrame. So I would like a DataFrame like this one:

0      1           2           3         4               5      6     7
sql    query       anything    anything  a phrase        1000   800   500
sql    query       anything    anything  anotherphrase   1000   800   500
...

The column 5 is the occurences of the hyponym part (sql), the column 6 is the number of occurrences of the hyperonym part (query) and the column 7 is the occurrences of the pair (sql, query). As you can see the pairs are the same but they are extracted from different phrases.

My question is how to do that? How can I attach occurrences as a new column to an existing DataFrame?

CodePudding user response:

Here's a solution on how to map the value counts of the combination of two columns to a new column:

# Create an example DataFrame
df = pd.DataFrame({0: ["a", "a", "a", "b"], 1: ["c", "d", "d", "d"]})

# Count the paired occurrences in a new column
df["count"] = df.groupby([0,1])[0].transform('size')

Before editing, I had answered this question with a solution using value_counts and a merge. This original solution is slower and more complicated than the groupby:

# Put the value_counts in a new DataFrame, call them count
vcdf = pd.DataFrame(df[[0, 1]].value_counts(), columns=["count"])

# Merge the df with the vcs
merged = pd.merge(left=df, right=vcdf, left_on=[0, 1], right_index=True)

# Potentially sort index
merged = merged.sort_index()

The resulting DataFrame:

   0  1  count
0  a  c      1
1  a  d      2
2  a  d      2
3  b  d      1
  • Related