Home > Net >  mysql to get related tags of a tag and order by intersection rate?
mysql to get related tags of a tag and order by intersection rate?

Time:12-17

What I try to do is create a wp plugin to get related tags of a tag and order by intersection rate.

For example, in a WordPress site, data structure like below:

  • 40 posts has tags B,
  • 200 posts has tags C,
  • 20 posts has tags A, B,
  • 40 posts has tags A, C

Now, I want to get related tags of tag A.

The formula of intersection rate:

INTERSECT_RATE = INTERSECT_COUNT / ALL_POST_COUNT_IN_A_TAG

So,

  • Rate for B is 20 / 40 = 0.5, 50%
  • Rate for C is 40 / 200 = 0.2, 20%

The problem is I'm not very familiar with the complex query of mysql. What I have done so far:

function get_related_terms($args = []) {
    $base_term_id = $args['term_id'];
    $base_taxonomy = $args['taxonomy'];
    $related_taxonomy = $args['rel_taxonomy'];
    $post_type = $args['post_type'];
    $number = $args['number'] ?? 20;

    global $wpdb;

    $key = $base_taxonomy . ':' . $related_taxonomy . ':' . $base_term_id;

    if ($terms = wp_cache_get($key, 'related_terms')) {
        return $terms;
    }

    $results = $wpdb->get_results($wpdb->prepare(
        "SELECT 
        related.term_id, 
        COUNT(*) as rel_count
    FROM
        {$wpdb->prefix}terms related
        INNER JOIN {$wpdb->prefix}term_taxonomy related_tax ON (related_tax.term_id = related.term_id)
        INNER JOIN {$wpdb->prefix}term_relationships related_rel ON (related_tax.term_taxonomy_id = related_rel.term_taxonomy_id)
        INNER JOIN {$wpdb->prefix}posts posts ON (related_rel.object_id = posts.ID)
        INNER JOIN {$wpdb->prefix}term_relationships base_rel ON (posts.ID = base_rel.object_id)
        INNER JOIN {$wpdb->prefix}term_taxonomy base_tax ON (base_rel.term_taxonomy_id = base_tax.term_taxonomy_id)
        INNER JOIN {$wpdb->prefix}terms base ON (base.term_id = base_tax.term_id)
    WHERE
        related_tax.taxonomy = '%s'
        AND base_tax.taxonomy = '%s'
        AND posts.post_type = '%s'
        AND posts.post_status = 'publish'
        AND base.term_id = %d
        AND related.term_id != base.term_id
    GROUP BY related_tax.term_id
    ORDER BY rel_count/related_rel.object_id DESC
    LIMIT 0, %d",
        $related_taxonomy,
        $base_taxonomy,
        $post_type,
        $base_term_id,
        $number
    ));

    // var_dump($results);

    $terms = array();
    foreach ($results as $result) {
        $term = get_term((int)$result->term_id, $related_taxonomy);
        $term->rel_count = (int)$result->rel_count;
        $terms[] = $term;
    }

    wp_cache_set($key, $terms, 'related_terms');

    return $terms;
}

The query results for related tags are not bad, but I doubt that maybe I did something wrong due to my bad mysql skill level, or there is something should be improved for better relevance or performance.

Any help? Thanks first.

CodePudding user response:

Step 1: Create and populate a simple table with the tags. I'm going to insist on making this new table because I fear there will be a gnarly mess of joins considering how WP works. I have enough JOINs in what I am presenting to not have to worry about WP.)

CREATE TABLE tags (
    post_id INT UNSIGNED NOT NULL,
    tag VARCHAR(99) NOT NULL,
    PRIMARY KEY(post_id, tag)
) ENGINE=InnoDB;

INSERT INTO tags
    SELECT tr.post_id, tt.term
        FROM term_relationships AS tr
        JOIN term_taxonomy AS tt
        ...;  ???
    -- I don't think we need to touch wp_posts.

Step 2: Check that new table to see if its data makes sense:

-- Count number of posts for each tag:
SELECT  tag, COUNT(*)
    FROM tags
    GROUP BY tag
    ORDER BY ct DESC  LIMIT 11;

-- Count number of posts for each pair of tags:
SELECT  a.tag, b.tag, COUNT(DISTINCT post_id)
    FROM tags AS a
    JOIN tags AS b  USING(post_id)
    WHERE a.tag != b.tag
    GROUP BY a.tag, b.tag
    ORDER BY ct DESC  LIMIT 11;

-- Note: that will deliberately duplicate things:
-- A,B and B,A
-- I expect step 3 to be simplified because of this.

Step 3: Get the "intersect rates":

Do you want all combinations? Just the highest? Lowest? For a given tag?

-- intersect (assuming only A is 'flowers')
SELECT  pairs.b_tag, c.tag,
        pairs.ct, c.ct,
        ROUND(100*pairs.ct/c.ct, 1) AS pct
    FROM ( SELECT  tag, COUNT(*) AS ct
            FROM tags
            WHERE tag = 'flowers'
            GROUP BY tag
         ) AS c
    JOIN ( SELECT   a.tag AS a_tag,
                    b.tag AS b_tag,
                    COUNT(*) AS ct
            FROM tags AS a
            JOIN tags AS b  USING(post_id)
            WHERE a.tag != b.tag
              AND a.tag = 'flowers' 
            GROUP BY a.tag, b.tag
          ) AS pairs
        ON pairs.a_tag = c.tag
    ORDER BY pairs.ct DESC
    limit 4;

With some sample data that I had, I came up with

 -------- --------- ----- ------ ------ 
| b_tag  | tag     | ct  | ct   | pct  |
 -------- --------- ----- ------ ------ 
| white  | flowers | 124 | 1748 |  7.1 |
| purple | flowers | 114 | 1748 |  6.5 |
| red    | flowers | 101 | 1748 |  5.8 |
| yellow | flowers |  90 | 1748 |  5.1 |
 -------- --------- ----- ------ ------ 
  • Related