Home > Enterprise >  Star Schema - Unify data with varying structure from different sources
Star Schema - Unify data with varying structure from different sources

Time:04-21

I am currently designing a star schema for a reporting database where an online product's performance is measured. The challenge is, that I receive information which is in principle measuring the same facts (visits, purchases) and has the same dimensions (user gender, user age, day) but with varying granularity depending on the source, for example, given a total of 10 visits:

  • Source A returns a single line per day for the performance in the format:
    1. Visits, Purchases, Gender, Age Range, Day (total visits = 15)
  • Source B returns two lines for a single day as it does not allow the combination of gender and age:
    1. Visits, Purchases, Gender, Day (total visits = 10)
    2. Visits, Purchases, Age Range, Day (total visits = 10)

The issues is, if I store them in the same fact table, I will have incorrect values when applying aggregate functions:

Day Visits Age Gender Source
19/04/2022 5 18-24 Male A
19/04/2022 10 18-24 Female A
19/04/2022 2 NULL Male B
19/04/2022 8 NULL Female B
19/04/2022 10 18-24 NULL B

(The sum of the visits column would count 20 for source B even though we only have 10 visits for this source, they just appear double due to the different data structure)

Is there a best practice for cases where dimensions and facts are generally the same, but the raw data granularity is different?

CodePudding user response:

Is there a best practice for cases where dimensions and facts are generally the same, but the raw data granularity is different?

You typically can only present the combined data at a grain that's compatible with all the sources, so (Day), (Age,Day), or (Gender,Day).

Alternatively you could "allocate" the Source B data, say applying the gender split for the day to each age group. The totals would work, but the drilldown wouldn't be meaningful.

  • Related