Data design considerations

Data modeling & exceptions

Exception example: "...and sometimes Y"

Consider the case of the letter Y in the English alphabet. According to Wikipedia: “In the English writing system, it sometimes represents a vowel and sometimes a consonant.” Building a data model for the English alphabet letters presents the architect with options. To keep things simple, we will model 2 fields. Field 1 with hold the value of each letter (i.e. A, B, C, etc.). Field 2 will be an attribute that designates whether the letter is a consonant or a vowel.

  • Field 1 will be called “Letter”.
  • Field 2 will be called “Type”
  • Design Option 1 – fixed attribute values
    In design option 1, we allow only the values of “Consonant” or “Vowel” in the type field. Since the letter Y can have both of these values, we will need to enter the value of Y twice.

    Table 1: values from Design Option 1
    Letter Type
    A Vowel
    B Consonant
    C Consonant
    D Consonant
    E Vowel
    ... ...
    Y ???

    Consequences of Design Option 1.

    Count of Letters

    A basic SELECT COUNT(Letter) from Table 1 will result in 27. That is not correct since there are only 26 letters in the English alphabet. SELECT Count(distinct Letter) from Table 1 will result in the appropriate answer of 26.

    Count of Types
    SELECT Type, COUNT(Type) from Table 1 GROUP BY TYPE will result in the follow counts:

    Table 2: summary of values from Design Option 1
    Type Count(Type)
    Consonant 21
    Vowel 6

    While this is correct, if you add the numbers together you get 27. You cannot use the SUM of Type values to reconcile with the total number of distinct values for Letter.

    Join considerations Depending on the use case, you may want to filter out the extra value for Y. This means include the DISTINCT keyword in your queries.

    Design Option 2 – No duplicates in Letter Field, new attribute added

    In design option 2 we don’t allow for duplicate values in the Letter field. Each entry must be unique. Since Y can be both a Vowel and a Consonant, we add a new attribute value called “Both”

    Table 3: values from Design Option 2
    Letter Type
    A Vowel
    B Consonant
    ... ...
    Y Both

    Consequences of design option 2.

    Count of Letters

    A basic SELECT COUNT(Letter) from Table 3 will result in 26. This is correct. Distinct is not required. SELECT Count(distinct Letter) from Table 3 will also result in the appropriate answer of 26.

    Count of Types

    SELECT Type, COUNT(Type) from Table 3 GROUP BY TYPE will result in the follow counts:

    Table 4: summary of values from Design Option 2
    Type Count(Type)
    Consonant 20
    Vowel 5
    Both 1

    SELECT COUNT(TYPE) From Table 3 will provide the correct answer of 26. However, each individual value will have an inaccurate count. The query SELECT COUNT(TYPE) from Table 3 WHERE TYPE = “Consonant” will result in 20. The correct number of Consonants including Y would be 21. You would need to remember to add AND TYPE=”Both” to get the appropriate answer of 21.

    Join considerations

    Since the field letter is always unique, you can easily join this table by Letter. The complexity enters when joining by individual values of type. Any tables joined based on TYPE values would need to account for the value of “both”. Consider a third table where Y has an alternate type value.

    Table 5: alternative values for Design Option 2
    Letter Type
    A Vowel
    B Consonant
    ... ...
    Y Sometimes

    The joining of Table 3 and Table 5 would result in a mismatch and records could be lost if the join is handled incorrectly. An alternative mapping table may be required. SELECT LETTER, TYPE from Table 3 INNER JOIN Table 5 ON (Table2.TYPE=Table3.TYPE) would only return 25 values from Table. LETTER value Y would not be in the resulting records and there would be data loss.

    Conclusions

    Data Model considerations can have a large impact on the relative correctness of the data. Design Option 1 challenged the assumption that values of Letter must be unique. However, it introduced the potential to enable inaccurate counts. This could be handled with proper query technique. Design Option 2 forced a technical constraint to require Letter to be unique values. This caused our data architect to create a new value for Type. Essentially, this table will become the originating source for this value of “both”. If this value is not properly documented and communicated to the broader data ecosystem, it can cause confusion and potential disconnects when sharing or joining data. This will require additional data governance infrastructure and processes to ensure proper usage.

    Table 6: comparison of Design Option 1 and Design Option 2
    Considerations Design Option 1 Design Option 2
    Data Standards
  • LETTER values must be in the English alphabet
  • Type values limited to standard terms (Consonant or Vowel)
  • LETTER values must be in the English alphabet
  • LETTER values must be unique
  • How exceptions were handled
  • LETTER values not unique
  • Record for Y was duplicated
  • Values for field TYPE were extended as needed
  • Potential for errors / data loss
  • Distinct keyword needed to get correct count for LETTER values
  • Counts for Consonant and Vowel not accurate
  • Mismatch between potential new values in the TYPE field could result in data loss when selecting data
  • Free Web Hosting