starschemar: Obtaining Star Schemas from Flat Tables

Jose Samos (jsamos@ugr.es)

2020-09-24

Introduction

The multidimensional data model was defined in the 1990s with the aim of supporting data analysis. Data in multidimensional systems is obtained from operational systems and is transformed to adapt it to the new structure.

Transformations can be carried out using professional ETL (extract, transform and load) tools. Recently, tools aimed at end users have emerged, which are also aimed at performing transformation operations. All these tools are very useful to carry out the transformation process, they provide a development environment to define the transformation operations in a general way.

Frequently, the operations to be performed aim to transform a flat table (with data that comes from operational systems) into a star schema (which implements a multidimensional system). With the tools mentioned above, this transformation can be carried out, but it requires a lot of work.

This is the main objective of starschemar package: Define transformations that allow obtaining star schemas from flat tables easily. In addition, it includes basic data cleaning operations, incremental data refresh operations and query operations, adapted to this context.

The rest of this document is structured as follows: First, basic concepts of dimensional modelling and star schemas are summarized. The following is an illustrative example of how the package works. Then, the operations available in the package are briefly presented. Finally, the document ends with conclusions and bibliography.

Dimensional modelling and star schemas

The content of this section is a summary based mainly on Adamson (2010) and Kimball and Ross (2013). It is intended to present the fundamental concepts of the area that are relevant to understand the functionality, naming and use of the starschemar package.

Dimensional modelling

Dimensional modelling aims to obtain simple data models. Simplicity is sought for two reasons: so that decision-makers can easily understand the data, and also so that they can be easily queried.

In dimensional modelling, the analysis of a business process is performed modelling how it is measured. The measures are called facts, and the descriptors of the context of the facts are dimensions. Facts are numerical data, and decision makers want to see them at various levels of detail, defined by dimensions.

Not all numerical data is a fact (some tools consider it that way). In dimensional modelling the designer has to differentiate between facts and dimensions. Some criteria are considered to distinguish between them, for example:

Sometimes there are no measures associated with the business process, it is simply recorded that the combination of dimensions has occurred. This situation is often called factless facts, Jensen, Pedersen, and Thomsen (2010) prefer to call it measureless facts. In any case, including when no other measures are available, a measure can be considered that represents the number of times the combination of dimension values occurs.

Dimensions and dimension attributes

Attributes considered by the designer as dimensions can be grouped taking into account the natural affinities between them. In particular, they can be grouped as they describe the “who, what, where, when, how and why” associated with the modelled business process. Two attributes share a natural affinity when they are only related in one context. When their relationships are determined by transactions or activities, they can occur in multiple contexts, if this occurs, they must be located in different dimensions.

In this way, a dimension is made up of a set of naturally related dimension attributes that describe the context of facts. Dimensions are used for two purposes: fact selection and fact grouping with the desired level of detail.

Additionally, in the dimensions hierarchies with levels and descriptors can be defined. More details can be found at Jensen, Pedersen, and Thomsen (2010). These concepts are not used in the current version of the package.

Facts and measures

A fact has a granularity, which is determined by the attributes of the dimensions that are considered at each moment. Thus, a measure in a fact has two components, the numerical property of the fact and an formula, frequently the SUM aggregation function, that allows combining several values of this measure to obtain a new value of the same measure with a coarser granularity (Jensen, Pedersen, and Thomsen 2010).

According to their behaviour to obtain a coarser granularity, three types of measures are distinguished: additive, semi-additive and non-additive. For additive measures, SUM is always a valid formula that maintains the meaning of the measure when the granularity changes. For semi-additive measures, there is no point in using SUM when changing the level of detail in any of the dimensions because the meaning of the measure changes, this frequently occurs in dimensions that represents time and measures representing inventory level. For non-additive measures, values cannot be combined across any dimension using SUM because the result obtained has a different meaning from the original measure (generally occurs with ratios, percentages or unit amounts such as unit cost or unit price).

The most useful measures are additive. If we have non-additive measures, they can generally be redefined from other additive measures.

Star schemas

Dimensional models implemented in RDBMS (Relational Database Management Systems) using a table for each dimension are called star schemas because of their resemblance to a star-like structure: A fact table in the centre and dimension tables around it. Thus, dimension attributes are columns of the respective dimension tables, and measures are columns of the fact table.

Other possible implementations on RDBMS normalize dimensions and are known as snowflake schema. More details can be found at Jensen, Pedersen, and Thomsen (2010). This is not considered in this package.

Dimension tables

Dimension tables contain the context associated with business process measures. Although they can contain any type of data, numerical data is generally not used for dimension attributes because some query tools consider any numeric data as a measure.

Dimension attributes with NULL value are a source of problems when querying since DBMS and query tools sometimes handle them inconsistently, the result depends on the product. It is recommended to avoid the use of NULL and replace them with a descriptive text. In the case of dates, it is recommended to replace the NULL values with an arbitrary date in the very far future.

Surrogate keys

A dimension table contains dimension attributes and also a surrogate key column. This column is a unique identifier that has no intrinsic meaning: It is generally an integer and is the primary key for the dimension table. In Adamson (2010) surrogate keys are easily identifiable by the suffix "_key" in the column name (and this criterion has also been applied in starschemar package).

Dimension tables also contain key columns that uniquely identify associated entities in an operational system. The separation of surrogate keys and natural keys allows the star schema to store changes in dimensions. Therefore, the use of surrogate keys in dimensions is a solution to the SCD (slowly changing dimensions) problem. This problem is not specifically addressed in this version of this package.

Special dimensions

In some cases, for the sake of simplicity, it is helpful to create a table that contains dimension attributes that have no natural affinities to each other, generally these are low-cardinality flags and indicators. The result is what is known as a junk dimension. They do not require any special support, only the designer’s will to define them.

Sometimes some dimension attributes are left in the fact table, usually transaction identifiers. It is considered as the primary key of a dimension that does not have an associated table, for this reason it is known as a degenerate dimension. Degenerate dimensions are not allowed in this package.

A single dimension can be referenced multiple times in a fact table, with each reference linked to a different logical role for each dimension. These separate dimension views, with unique attribute column names, are called role dimensions and the common dimension is called a role-playing dimension.

Associated with multiple star schemas we have the conformed dimensions that are presented in section Conformed dimensions.

Fact table

At the centre of a star schema is the fact table. In addition to containing measures, the fact table includes foreign keys that refer to each of the surrogate keys in the dimension tables.

Primary key

A subset of foreign keys, along with possibly degenerate dimensions, is considered to form the primary key of the fact table.

In starschemar package, since degenerate dimensions are not allowed, the primary key is made up of a subset of foreign keys.

Grain

The subset of dimensions that forms the primary key defines the level of detail stored in the fact table, which is known as the fact table’s grain. In the design process, it is very important for the designer to clearly define the grain of the fact table (it is usually defined by listing the dimensions whose surrogate keys form its primary key): it is a way to ensure that all the facts are stored at the same level of detail.

At the finest grain, a row in the fact table corresponds to the measures of an event and vice versa, it is not influenced by the possible reports that may be obtained. When two facts have different grains, they should be set on different fact tables.

Multiple fact tables

It is frequent the need to have several fact tables for various reasons:

  • We find measurements with different grain.

  • There are measures that do not occur simultaneously, for example, when one occurs, we have no value for others and vice versa.

In reality it is about different business processes, each one has to have its own fact table but they have dimensions in common. This is known as a fact constellation which corresponds to the Kimball enterprise data warehouse bus architecture.

Conformed dimensions

When star schemas share a set of common dimensions, these dimensions are called conformed dimensions.

There are several possibilities to have conformed dimensions, the most obvious form is that the dimension tables share structure and content, that is, they are identical dimensions. This is the one considered in this version of the starschemar package.

Additional operations

Cleaning and conforming data

When data is loaded into a star schema, errors or inconsistencies can be discovered in some of them. In some cases, it is best to make corrections at the source of the data, in operational systems. Sometimes this is not possible and there is no other option but to modify the data before loading it into the star schema or even when it is already loaded.

Inconsistencies are often found in dimensions, when dimensions are integrated into one, for example to generate a role-playing dimension or conformed dimensions. Support for modifying dimension data is provided in the package.

Dimension enrichment

The more attributes the dimensions have, the more possibilities they offer to filter or aggregate data. For this reason, it is convenient to enrich the dimensions with new attributes, whenever possible. New attributes are often defined based on existing ones. There must be a way to associate the new attributes with the dimensions.

The more attributes the dimensions have, the more possibilities they offer to filter or aggregate data. For this reason, it is convenient to enrich the dimensions with new attributes, whenever possible. New attributes are often defined based on existing ones. There must be a way to associate the new attributes with the dimensions.

Operations have been defined in the package to export dimension attributes so that their values are not repeated, and also to import them, once the new attributes have been added.

Incremental refresh

When a star schema is built, an initial load is performed with all available data from a moment in time onwards.

Operational systems continue to operate and produce data. If we want to incorporate these data into the star schema, we have two possibilities:

  • Carry out a new load with all the data available at that time.

  • Perform an incremental refresh with only the new data.

In order to carry out this second option, the CDC (change data capture) system allows to exclusively obtain the new data produced.

Sometimes it is also convenient to delete data that is considered to be no longer necessary, generally data from the more distant past.

In this package, it has been considered that we can obtain the new data, possibly mixed with updates to data already incorporated into the star schema, in order to carry out an incremental refresh of star schemas with them. Operations are also offered to select data and delete it if it is not considered necessary.

An illustrative example

Starting data sets

To illustrate how the package works we will use a small part of the Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System data set1. Specifically, only the data for the first 11 weeks of 1962 are considered.

Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths <1 year (all cause deaths) 1-24 years (all cause deaths) 25-44 years 45-64 years (all cause deaths) 65+ years (all cause deaths)
1962 1 1962-01-06 1 MA Boston 11 262 10 8 11 87 146
1962 2 1962-01-13 1 MA Boston 11 270 14 8 11 70 167
1962 3 1962-01-20 1 MA Boston 5 237 11 8 10 66 142
1962 4 1962-01-27 1 MA Boston 12 285 22 7 8 73 175
1962 5 1962-02-03 1 MA Boston 5 245 15 9 17 62 142
1962 6 1962-02-10 1 MA Boston 11 284 16 8 16 79 165
1962 7 1962-02-17 1 MA Boston 8 227 11 7 13 71 125
1962 8 1962-02-24 1 MA Boston 10 270 14 9 10 85 152
1962 9 1962-03-03 1 MA Boston 16 278 15 11 15 79 158
1962 10 1962-03-10 1 MA Boston 11 297 11 7 18 80 181
1962 11 1962-03-17 1 MA Boston 16 290 16 7 21 74 172
1962 1 1962-01-06 1 CT Bridgeport 3 46 3 0 0 16 27

In the table above, the first rows of the original data are shown. For each week and city, mortality figures by age group and cause, considered separately, are included (i.e., the combination of age group and cause is not included). In the cause, only a distinction is made between pneumonia or influenza and others.

It can be seen that there is only one measure, deaths, defined at two different granularities: week-city-cause (data in columns “Pneumonia and Influenza Deaths” and “All Deaths”) and week-city-age bracket (data from the 5 columns on the right). This means that we will need two fact tables to include all the data in star schemas.

As the data is partially in the form of a pivot table, from these data two tables have been generated2, one for each granularity. Additionally, the following operations have been carried out:

Year WEEK Week Ending Date REGION State City Age Range Deaths
1962 1 1962-01-06 1 MA Boston <1 year 10
1962 1 1962-01-06 1 MA Boston 1-24 years 8
1962 1 1962-01-06 1 MA Boston 25-44 years 11
1962 1 1962-01-06 1 MA Boston 45-64 years 87
1962 1 1962-01-06 1 MA Boston 65+ years 146
1962 2 1962-01-13 1 MA Boston <1 year 14

In the table above, the first rows of the flat table that contains the data according to the age bracket are shown.

The following table shows the first 18 rows of the flat table containing data based on cause of death. The calculated column Other Deaths has been added. Here you can see some missing data (columns Year and WEEK), errors (value “Bridgepor” in column City) and how there are only data from the first 9 weeks.

Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths Other Deaths
1962 1 1962-01-06 1 MA Boston 11 262 251
1962 2 1962-01-13 1 MA Boston 11 270 259
1962 3 1962-01-20 1 MA Boston 5 237 232
1962 4 1962-01-27 1 MA Boston 12 285 273
1962 5 1962-02-03 1 MA Boston 5 245 240
NA NA 1962-02-10 1 MA Boston 11 284 273
1962 7 1962-02-17 1 MA Boston 8 227 219
1962 8 1962-02-24 1 MA Boston 10 270 260
1962 9 1962-03-03 1 MA Boston 16 278 262
1962 1 1962-01-06 1 CT Bridgeport 3 46 43
1962 2 1962-01-13 1 CT Bridgeport 2 43 41
1962 3 1962-01-20 1 CT Bridgeport 2 40 38
1962 4 1962-01-27 1 CT Bridgeport 4 46 42
1962 5 1962-02-03 1 CT Bridgeport 5 46 41
1962 6 1962-02-10 1 CT Bridgeport 3 42 39
1962 7 1962-02-17 1 CT Bridgeport 1 35 34
1962 8 1962-02-24 1 CT Bridgeport 2 45 43
1962 9 1962-03-03 1 CT Bridgepor 3 46 43

To have more layout possibilities to display, new columns with dates have been generated. Next, in both tables, the new columns related to the date that have been added can be seen.

Reception Year Reception Week Reception Date Data Availability Year Data Availability Week Data Availability Date Year WEEK Week Ending Date REGION State City Age Range Deaths
1962 2 1962-01-10 1962 3 1962-01-19 1962 1 1962-01-06 1 MA Boston <1 year 10
1962 2 1962-01-10 1962 3 1962-01-19 1962 1 1962-01-06 1 MA Boston 1-24 years 8
1962 2 1962-01-10 1962 3 1962-01-19 1962 1 1962-01-06 1 MA Boston 25-44 years 11
1962 2 1962-01-10 1962 3 1962-01-19 1962 1 1962-01-06 1 MA Boston 45-64 years 87
1962 2 1962-01-10 1962 3 1962-01-19 1962 1 1962-01-06 1 MA Boston 65+ years 146
1962 3 1962-01-16 1962 3 1962-01-18 1962 2 1962-01-13 1 MA Boston <1 year 14
Reception Year Reception Week Reception Date Data Availability Year Data Availability Week Data Availability Date Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths Other Deaths
1962 2 1962-01-10 1962 3 1962-01-19 1962 1 1962-01-06 1 MA Boston 11 262 251
1962 3 1962-01-16 1962 3 1962-01-18 1962 2 1962-01-13 1 MA Boston 11 270 259
1962 4 1962-01-23 1962 4 1962-01-25 1962 3 1962-01-20 1 MA Boston 5 237 232
1962 4 1962-01-27 1962 6 1962-02-07 1962 4 1962-01-27 1 MA Boston 12 285 273
1962 6 1962-02-08 1962 7 1962-02-14 1962 5 1962-02-03 1 MA Boston 5 245 240
1962 6 1962-02-10 1962 7 1962-02-12 NA NA 1962-02-10 1 MA Boston 11 284 273

These are the flat tables that will be considered as a starting point to obtain star schemas from them in this example. They are available in the package: mrs_age and mrs_cause respectively.

Dimensional modelling

For each flat table, the goal is to define the attributes that correspond to facts and those that are dimensions. For facts, measures and their aggregation functions have to be defined. For dimensions, attributes with natural affinity must be grouped. Each attribute can only appear once in the definition.

Dimensional modelling data according to age range

To avoid having to write the name of the attributes of the table, with the following function we can have them in the form of a string. Thus, we can copy and paste each name as needed.

dput(colnames(mrs_age))
#> c("Reception Year", "Reception Week", "Reception Date", "Data Availability Year", 
#> "Data Availability Week", "Data Availability Date", "Year", "WEEK", 
#> "Week Ending Date", "REGION", "State", "City", "Age Range", "Deaths"
#> )

The definition of the dimensional model for the data considered is shown below.

library(tidyr)
library(starschemar)

dm_mrs_age <- dimensional_model() %>%
  define_fact(
    name = "mrs_age",
    measures = c(
      "Deaths"
    ),
    agg_functions = c(
      "SUM"
    ),
    nrow_agg = "nrow_agg"
  ) %>%
  define_dimension(
    name = "when",
    attributes = c(
      "Week Ending Date",
      "WEEK",
      "Year"
    )
  ) %>%
  define_dimension(
    name = "when_available",
    attributes = c(
      "Data Availability Date",
      "Data Availability Week",
      "Data Availability Year"
    )
  ) %>%
  define_dimension(
    name = "where",
    attributes = c(
      "REGION",
      "State",
      "City"
    )
  ) %>%
  define_dimension(
    name = "who",
    attributes = c(
      "Age Range"
    )
  )

In this case, all the elements have been explicitly defined, including aggregation functions and the name of an additional measure representing the number of rows aggregated, which is always included. Only data from two of the three possible time-related dimensions have been considered.

Dimensional modelling data according to cause

In the case of data according to cause of death, the definition of the model is shown below.

dm_mrs_cause <- dimensional_model() %>%
  define_fact(
    name = "mrs_cause",
    measures = c(
      "Pneumonia and Influenza Deaths",
      "Other Deaths"
    ),
  ) %>%
  define_dimension(
    name = "when",
    attributes = c(
      "Week Ending Date",
      "WEEK",
      "Year"
    )
  ) %>%
  define_dimension(
    name = "when_received",
    attributes = c(
      "Reception Date",
      "Reception Week",
      "Reception Year"
    )
  ) %>%
  define_dimension(
    name = "when_available",
    attributes = c(
      "Data Availability Date",
      "Data Availability Week",
      "Data Availability Year"
    )
  ) %>%
  define_dimension(
    name = "where",
    attributes = c(
      "REGION",
      "State",
      "City"
    )
  )

If no aggregation function is indicated, by default, SUM is considered. Although not explicitly stated, it also includes by default the measure relative to the number of rows aggregated. In this case, the three dimensions related to the date have been defined.

Star schema definition and transformation: Constellations

To define a star schema, we need a flat table and a dimensional model defined from it. Once defined, we can apply format modification operations to it.

Star schema definition for data according to age range

The basic definition operation of a star schema is shown below3.

st_mrs_age <- star_schema(mrs_age, dm_mrs_age)

The first rows of the obtained dimension and fact tables are shown below.

when_key Week Ending Date WEEK Year
1 1962-01-06 1 1962
2 1962-01-13 2 1962
3 1962-01-20 3 1962
4 1962-01-27 4 1962
5 1962-02-03 5 1962
6 1962-02-10 6 1962
when_available_key Data Availability Date Data Availability Week Data Availability Year
1 1962-01-06 1 1962
2 1962-01-07 1 1962
3 1962-01-08 2 1962
4 1962-01-09 2 1962
5 1962-01-10 2 1962
6 1962-01-11 2 1962
where_key REGION State City
1 1 CT Bridgepor
2 1 CT Bridgeport
3 1 CT Hartford
4 1 CT New Haven
5 1 CT Waterbury
6 1 MA Boston
who_key Age Range
1 1-24 years
2 25-44 years
3 45-64 years
4 65+ years
5 <1 year
when_key when_available_key where_key who_key Deaths nrow_agg
1 1 7 2 3 1
1 1 7 3 13 1
1 1 7 4 20 1
1 1 7 5 2 1
1 1 21 1 2 1
1 1 21 2 3 1

The data from the original flat table has been structured in the form of dimension tables and fact tables. Data in the columns of the original table included in the dimensions is not repeated. A surrogate key has been added to each of the dimension tables that are foreign keys in the fact table.

Next, we will apply format modification operations to the original structure obtained.

st_mrs_age <- st_mrs_age %>%
  role_playing_dimension(
    dim_names = c("when", "when_available"),
    name = "When Common",
    attributes = c("date", "week", "year")
  ) %>%
  snake_case() %>%
  character_dimensions(NA_replacement_value = "Unknown",
                       length_integers = list(week = 2))

First, a role playing dimension has been defined based on the dimensions related to dates. Then, to work with databases, the names have been adapted to the snake case criterion. Finally, the data type of the attributes of the dimensions has been transformed so that all columns except the date columns are of the character data type, in the case of numerical data, it is allowed to indicate the length of the field to fill with leading zeros, and undefined values have been replaced by the indicated value.

The first rows of the new dimension and fact tables are shown below.

when_key week_ending_date week year
when_available_key data_availability_date data_availability_week data_availability_year
where_key region state city
1 1 CT Bridgepor
2 1 CT Bridgeport
3 1 CT Hartford
4 1 CT New Haven
5 1 CT Waterbury
6 1 MA Boston
who_key age_range
1 1-24 years
2 25-44 years
3 45-64 years
4 65+ years
5 <1 year
when_common_key date week year
1 1962-01-06 01 1962
2 1962-01-07 01 1962
3 1962-01-08 02 1962
4 1962-01-09 02 1962
5 1962-01-10 02 1962
6 1962-01-11 02 1962
when_available_key when_key where_key who_key deaths nrow_agg
1 1 7 2 3 1
1 1 7 3 13 1
1 1 7 4 20 1
1 1 7 5 2 1
1 1 21 1 2 1
1 1 21 2 3 1

In the result, it can be seen that the dimensions related to date are now role dimensions and do not have their own data, a role playing dimension has been generated with the integrated data. The fact table continues to refer to role dimensions, the value of foreign keys has been adapted to the possible new values of the surrogate keys. Additionally it can be seen that the week field now has length 2 and has 0 on the left (this is useful to sort numbers in text format).

Star schema definition for data according to cause

We are going to define the star schema and apply similar transformations to the other flat table. The transformations can be applied in any order.

st_mrs_cause <- star_schema(mrs_cause, dm_mrs_cause) %>%
  snake_case() %>%
  character_dimensions(
    NA_replacement_value = "Unknown",
    length_integers = list(
      week = 2,
      data_availability_week = 2,
      reception_week = 2
    )
  ) %>%
  role_playing_dimension(
    dim_names = c("when", "when_received", "when_available"),
    name = "when_common",
    attributes = c("date", "week", "year")
  )

In this case, since the role playing dimension definition is the last transformation defined, the format of the week column had to be defined in the three date dimensions to obtain an equivalent result. The result obtained is shown below.

when_key week_ending_date week year
when_received_key reception_date reception_week reception_year
when_available_key data_availability_date data_availability_week data_availability_year
where_key region state city
1 1 CT Bridgepor
2 1 CT Bridgeport
3 1 CT Hartford
4 1 CT New Haven
5 1 CT Waterbury
6 1 MA Boston
when_common_key date week year
1 1962-01-06 01 1962
2 1962-01-07 01 1962
3 1962-01-08 02 1962
4 1962-01-09 02 1962
5 1962-01-10 02 1962
6 1962-01-11 02 1962
when_available_key when_received_key when_key where_key pneumonia_and_influenza_deaths other_deaths nrow_agg
1 1 1 7 1 37 1
1 1 1 21 5 32 1
1 1 1 28 2 24 1
1 1 1 50 0 35 1
1 1 1 61 12 207 1
1 1 1 66 2 81 1

In this case we have three role dimensions defined on a role playing dimension.

Star schema transformation, cleaning and conforming data

Star schemas are defined from flat table fields. In some cases it may be interesting to rename elements of the schema, especially attributes of dimensions and measures. On the other hand, dimensions can be enriched by adding additional attributes, generally derived from the rest of the attributes. These are the transformations considered in this section.

We can perform data cleaning and conforming operations on star schema dimensions. Updates defined in a star schema can be applied on another with common dimensions.

Star schema rename

If necessary, the elements of a star schema can be renamed. These functions are especially useful for renaming attributes of dimensions and measures of facts because the definition operations only allowed us to select columns of a flat table, not modifying their names.

We are going to modify the attribute names of the when dimension for the two design cases. The values of the attribute called region actually correspond to the concept of division in the organization of the US territory, we are going to rename it this way for the two design cases.

Data according to age:

st_mrs_age <-
  st_mrs_age %>% rename_dimension_attributes(
    name = "when",
    attributes = c("week_ending_date", "week", "year"),
    new_names = c(
      "when_happened_date",
      "when_happened_week",
      "when_happened_year"
    )
  ) %>%
  rename_dimension_attributes(
    name = "where",
    attributes = c("region"),
    new_names = c("division")
  )

And data according to cause:

st_mrs_cause <-
  st_mrs_cause %>% rename_dimension_attributes(
    name = "when",
    attributes = c("week_ending_date", "week", "year"),
    new_names = c(
      "when_happened_date",
      "when_happened_week",
      "when_happened_year"
    )
  ) %>%
  rename_dimension_attributes(
    name = "where",
    attributes = c("region"),
    new_names = c("division")
  )

We are also going to modify the name of the measure for data according to age:

st_mrs_age <-
  st_mrs_age %>% rename_measures(measures = c("deaths"),
                                 new_names = c("n_deaths"))

Additionally, functions are available to modify the names of dimensions and facts. In this case it is not necessary to modify them because they do not depend on the names of the starting base data and we have defined them as we wanted.

Definition of updates

Using the following code, first of all, we get the names of the dimensions of a star schema, then, we get them by their name (we can see the rows using the utils::View function). If there are several role dimensions, it is enough to consult one of them, updates defined on it will be propagated to the rest.

dim_names <- st_mrs_age %>%
    get_dimension_names()

where <- st_mrs_age %>%
  get_dimension("where")

# View(where)
# where[where$where_key %in% c(1, 2, 62), ]

when <- st_mrs_age %>%
  get_dimension("when")

# View(when)
# when[when$when_key %in% c(36, 37, 73), ]

who <- st_mrs_age %>%
  get_dimension("who")

# View(who)

Reviewing the dimensions we can detect the need for updates. Errors are also sometimes detected by comparing the values with equivalent values obtained from other data sources.

Updates in the where dimension

In the where dimension we find the wrong value of “Bridgepor”, generated by ourselves, the correct value is “Bridgeport”, also included in the dimension. Additionally, trying to find additional data associated with cities, we found another error for the city of “Wilimington”, whose correct name is “Wilmington”. Below are the mentioned instances.

where_key division state city
1 1 CT Bridgepor
2 1 CT Bridgeport
62 5 DE Wilimington

Updates can be defined in several ways. One of them is referencing the surrogate key. Although in some of the functions the dimension surrogate key is referred to, updates only consider the values of the rest of the columns of the corresponding dimension. In this way, it is achieved that updates can be applied to equivalent dimensions of other star schemas, where the values of the surrogate key do not necessarily coincide with those of the dimension where updates were originally defined.

Next, we define the update that indicates that the values of record “1” (“Bridgepor”) of the where dimension must match those of record “2” (“Bridgeport”).

updates_st_mrs_age <- record_update_set() %>%
  match_records(dimension = where,
                old = 1,
                new = 2) 

For the case of the other city, we define an update that replaces the value in the indicated fields for the records that meet the defined condition, as shown below.

updates_st_mrs_age <- updates_st_mrs_age %>%
  update_selection_general(
    dimension = where,
    columns_old = c("state", "city"),
    old_values = c("DE", "Wilimington"),
    columns_new = c("city"),
    new_values = c("Wilmington")
  ) 
Updates in the when dimension

Below is a selection of rows involved in update operations that are referred by their surrogate key in the when dimension.

when_key when_happened_date when_happened_week when_happened_year
36 1962-02-10 06 1962
37 1962-02-10 Unknown Unknown
73 9999-12-31 07 1962

The corresponding updates are shown below.

updates_st_mrs_age <- updates_st_mrs_age %>%
  match_records(dimension = when,
                old = 37,
                new = 36) %>%
  update_record(
    dimension = when,
    old = 73,
    values = c("1962-02-17", "07", "1962")
  )

The values of record “37” must match those of record “36” (both registers will be unified). The values in register “73” will be replaced by those provided.

Updates in the who dimension

In the case of the who dimension, we want to include a code in each value of age_range attribute so that they are sorted increasingly based on their value.

updates_st_mrs_age <- updates_st_mrs_age %>%
  update_selection(
    dimension = who,
    columns = c("age_range"),
    old_values = c("<1 year"),
    new_values = c("1: <1 year")
  ) %>%
  update_selection(
    dimension = who,
    columns = c("age_range"),
    old_values = c("1-24 years"),
    new_values = c("2: 1-24 years")
  ) %>%
  update_selection(
    dimension = who,
    columns = c("age_range"),
    old_values = c("25-44 years"),
    new_values = c("3: 25-44 years")
  ) %>%
  update_selection(
    dimension = who,
    columns = c("age_range"),
    old_values = c("45-64 years"),
    new_values = c("4: 45-64 years")
  ) %>%
  update_selection(
    dimension = who,
    columns = c("age_range"),
    old_values = c("65+ years"),
    new_values = c("5: 65+ years")
  )

In each case, we substitute the old value for the new one in the indicated attribute.

Updates application

Once updates are defined, they can be applied on the star schema from which they have been defined, as shown below.

st_mrs_age <- st_mrs_age %>%
  modify_dimension_records(updates_st_mrs_age)

The result obtained for the first star schema is shown below.

when_key when_happened_date when_happened_week when_happened_year
when_available_key data_availability_date data_availability_week data_availability_year
where_key division state city
1 1 CT Bridgeport
2 1 CT Hartford
3 1 CT New Haven
4 1 CT Waterbury
5 1 MA Boston
6 1 MA Cambridge
who_key age_range
1 1: <1 year
2 2: 1-24 years
3 3: 25-44 years
4 4: 45-64 years
5 5: 65+ years
when_common_key date week year
1 1962-01-06 01 1962
2 1962-01-07 01 1962
3 1962-01-08 02 1962
4 1962-01-09 02 1962
5 1962-01-10 02 1962
6 1962-01-11 02 1962
who_key where_key when_available_key when_key n_deaths nrow_agg
1 1 10 1 3 1
1 1 12 8 1 1
1 1 18 15 5 1
1 1 23 22 3 1
1 1 31 29 6 1
1 1 49 36 3 1

It can be seen that the row with the value “Bridgepor” in the city column has disappeared: it has been merged with the row with the correct value. This update has also been transmitted to the fact table. Although it is not seen in the tables, the same has happened with the dates that have been unified by the update.

The same updates can also be applied to other star schemas with dimensions in common with the original star schema, as shown below.

st_mrs_cause <- st_mrs_cause %>%
  modify_dimension_records(updates_st_mrs_age)

Updates defined on dimensions not included in the star schema are ignored. In this case those referring to the who dimension. Likewise, if there are no records that meet the conditions to get the old value, they are also ignored.

Dimension enrichment

From the original attributes included in the dimensions, we can obtain new attributes that facilitate queries or offer new query possibilities.

Enrich the who dimension

Suppose that we are interested in defining some broader age ranges than the existing ones. This operation can be done by enriching the corresponding dimension.

First, we export the attributes to consider in table form, in this case only the age range.

tb_who <-
  enrich_dimension_export(st_mrs_age,
                          name = "who",
                          attributes = c("age_range"))

Next, we can see the result of the export operation. It is a table with the selected attributes where duplicate values have been eliminated if there are any (in this case there are no repeated values).

age_range
1: <1 year
2: 1-24 years
3: 25-44 years
4: 45-64 years
5: 65+ years

In the table we add the columns that we want. In this case a new column to define the new broader age range.

v <-
  c("0-24 years", "0-24 years", "25+ years", "25+ years", "25+ years")
tb_who <-
  tibble::add_column(tb_who,
                     wide_age_range = v)

The new table can be seen below.

age_range wide_age_range
1: <1 year 0-24 years
2: 1-24 years 0-24 years
3: 25-44 years 25+ years
4: 45-64 years 25+ years
5: 65+ years 25+ years

We enrich the dimension considering the new data in the table.

st_mrs_age <-
  st_mrs_age %>%
  enrich_dimension_import(name = "who", tb_who)

We can see the result below, where the dimension has the new defined attribute.

who_key age_range wide_age_range
1 1: <1 year 0-24 years
2 2: 1-24 years 0-24 years
3 3: 25-44 years 25+ years
4 4: 45-64 years 25+ years
5 5: 65+ years 25+ years
Enrich the where dimension

For the where dimension we can proceed in the same way as we have done for the who dimension: Export the data, complete it manually and import it again, as shown below.

tb_where <-
  enrich_dimension_export(st_mrs_age,
                          name = "where",
                          attributes = c("division"))

The new table for division data can be seen below.

division
1
2
3
4
5
6
7
8
9

We look for the names of the divisions and add the data of the regions to which they belong.

tb_where <-
  tibble::add_column(
    tb_where,
    division_name = c(
      "New England",
      "Middle Atlantic",
      "East North Central",
      "West North Central",
      "South Atlantic",
      "East South Central",
      "West South Central",
      "Mountain",
      "Pacific"
    ),
    region = c('1',
               '1',
               '2',
               '2',
               '3',
               '3',
               '3',
               '4',
               '4'),
    region_name = c(
      "Northeast",
      "Northeast",
      "Midwest",
      "Midwest",
      "South",
      "South",
      "South",
      "West",
      "West"
    )
  )

st_mrs_age <-
  st_mrs_age %>%
  enrich_dimension_import(name = "where", tb_where)

st_mrs_cause <-
  st_mrs_cause %>%
  enrich_dimension_import(name = "where", tb_where)

To add the name of the states and the county to which each city belongs, we could proceed in the same way. However, it is easier if we try to locate this data and use it directly. These data are available in the ft_usa_states and ft_usa_city_county data sets, respectively.

However, if we operate in the same way, when importing the data an error occurs. The reason is that not all the data in the dimension matches the data in the imported table. We can determine the missing data using the following function.

tb_missing <-
  st_mrs_age %>%
  enrich_dimension_import_test(name = "where", ft_usa_states)

The result obtained is shown below.

where_key division state city division_name region region_name
48 3 Unknown Unknown East North Central 2 Midwest
78 6 Unknown Unknown East South Central 3 South
91 7 Unknown Unknown West South Central 3 South
111 9 Unknown Unknown Pacific 4 West

In all cases, the problem occurs for the value “Unknown” in the state attribute. We must add a row to the data before importing it.

tb_where_state <- ft_usa_states %>%
  tibble::add_row(state = "Unknown", state_name = "Unknown")

st_mrs_age <-
  st_mrs_age %>%
  enrich_dimension_import(name = "where", tb_where_state)

st_mrs_cause <-
  st_mrs_cause %>%
  enrich_dimension_import(name = "where", tb_where_state)

The same problem occurs and we apply the same solution to add the county data4.

tb_where_county <- ft_usa_city_county %>%
  tibble::add_row(city = "Unknown",
                  state = "Unknown",
                  county = "Unknown")

st_mrs_age <-
  st_mrs_age %>%
  enrich_dimension_import(name = "where", tb_where_county)

st_mrs_cause <-
  st_mrs_cause %>%
  enrich_dimension_import(name = "where", tb_where_county)

We can see the first rows of the final result below.

where_key division state city division_name region region_name state_name county
1 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield
2 1 CT Hartford New England 1 Northeast Connecticut Hartford
3 1 CT New Haven New England 1 Northeast Connecticut New Haven
4 1 CT Waterbury New England 1 Northeast Connecticut New Haven
5 1 MA Boston New England 1 Northeast Massachusetts Suffolk
6 1 MA Cambridge New England 1 Northeast Massachusetts Middlesex
7 1 MA Fall River New England 1 Northeast Massachusetts Bristol
8 1 MA Lowell New England 1 Northeast Massachusetts Middlesex
9 1 MA Lynn New England 1 Northeast Massachusetts Essex
10 1 MA New Bedford New England 1 Northeast Massachusetts Bristol

Constellation definition

A constellation is defined from a list of star schemas, as shown below.

ct_mrs <- constellation(list(st_mrs_age, st_mrs_cause), name = "mrs")

All dimensions of the same name in star schemas must be compatible in structure and type of columns, and are defined as conformed dimensions. The conformed dimensions share all the instances of the original dimensions, this implies possible modifications in the surrogate keys that are transmitted to foreign keys of the component fact tables.

The tables of the obtained conformed dimensions are shown below.

when_key when_happened_date when_happened_week when_happened_year
1 1962-01-06 01 1962
2 1962-01-07 01 1962
3 1962-01-08 02 1962
4 1962-01-09 02 1962
5 1962-01-10 02 1962
6 1962-01-11 02 1962
when_available_key data_availability_date data_availability_week data_availability_year
1 1962-01-06 01 1962
2 1962-01-07 01 1962
3 1962-01-08 02 1962
4 1962-01-09 02 1962
5 1962-01-10 02 1962
6 1962-01-11 02 1962
where_key division state city division_name region region_name state_name county
1 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield
2 1 CT Hartford New England 1 Northeast Connecticut Hartford
3 1 CT New Haven New England 1 Northeast Connecticut New Haven
4 1 CT Waterbury New England 1 Northeast Connecticut New Haven
5 1 MA Boston New England 1 Northeast Massachusetts Suffolk
6 1 MA Cambridge New England 1 Northeast Massachusetts Middlesex

It can be seen that the conformed dimensions are considered regardless of the type of dimension in the star schema. Definition of conformed dimensions does not imply any change in the definition of dimensions in star schemas: Role and role playing dimensions remain the same, only their rows may have changed.

In this particular case there are no discrepancies in the values of the star schemas dimension instances, since the data source is the same for both.

If discrepancies are detected once the integration has been carried out, new modification operations can be defined on conformed dimensions (as it has been done for the dimensions of star schemas) that are applied at the constellation level and are automatically transmitted to the component star schemas (modify_conformed_dimension_records).

Incremental refresh

Once we have star schemas built with the data available at the moment, we may obtain additional data, with the same structure as the initial data but from a later time. Sometimes the new data also includes data from previous periods to operate on them.

Under these conditions, suppose we get the data sets mrs_age_w10, mrs_age_w11, mrs_cause_w10, and mrs_cause_w11, for weeks 10 and 11 (data in star schemas runs through week 9). In all cases, some data from previous periods are included.

To perform an incremental refresh of a star schema, we must have the new data in the same star schema format. Additionally, if we have done data cleaning, it is likely that we will have to correct part of the corrected errors again over the new data. For this reason, it is best to package all the transformations carried out on the original data in function form so that they can be easily applied to new data.

Refresh operations for data according to age range

Below you can see the function that groups the transformations defined for the data according to the age range.

mrs_age_definition <-
  function(ft,
           dm,
           updates,
           tb_who,
           tb_where,
           tb_where_state,
           tb_where_county) {
    star_schema(ft, dm) %>%
      role_playing_dimension(
        dim_names = c("when", "when_available"),
        name = "When Common",
        attributes = c("date", "week", "year")
      ) %>%
      snake_case() %>%
      character_dimensions(NA_replacement_value = "Unknown",
                           length_integers = list(week = 2)) %>%
      rename_dimension_attributes(
        name = "when",
        attributes = c("week_ending_date", "week", "year"),
        new_names = c(
          "when_happened_date",
          "when_happened_week",
          "when_happened_year"
        )
      ) %>%
      rename_dimension_attributes(
        name = "where",
        attributes = c("region"),
        new_names = c("division")
      ) %>%
      rename_measures(measures = c("deaths"),
                      new_names = c("n_deaths")) %>%
      modify_dimension_records(updates) %>%
      enrich_dimension_import(name = "who", tb_who) %>%
      enrich_dimension_import(name = "where", tb_where) %>%
      enrich_dimension_import(name = "where", tb_where_state) %>%
      enrich_dimension_import(name = "where", tb_where_county)
  }

We apply this function to new data sets, as shown below.

st_mrs_age_w10 <-
  mrs_age_definition(
    mrs_age_w10,
    dm_mrs_age,
    updates_st_mrs_age,
    tb_who,
    tb_where,
    tb_where_state,
    tb_where_county
  )

st_mrs_age_w11 <-
  mrs_age_definition(
    mrs_age_w11,
    dm_mrs_age,
    updates_st_mrs_age,
    tb_who,
    tb_where,
    tb_where_state,
    tb_where_county
  )

Errors may occur because the data is different from the original, especially in the dimension enrichment part, if data is missing. In this case, we must eliminate from the function the lines corresponding to enrichment and later try to enrich the dimension by checking the errors with enrich_dimension_import_test.

Once we have the data in the same format, we can apply the incremental refresh to the original star schema, as follows.

st_mrs_age <- st_mrs_age %>%
  incremental_refresh_star_schema(st_mrs_age_w10, existing = "replace") %>%
  incremental_refresh_star_schema(st_mrs_age_w11, existing = "replace")

In this case, it has been assumed that if data from previous periods appears among the new data, the new data has to replace the previous data (value “replace” in existing parameter).

If the star schema has been integrated into a constellation, the incremental refresh can be performed on it, as follows.

ct_mrs <- ct_mrs %>%
  incremental_refresh_constellation(st_mrs_age_w10, existing = "replace") %>%
  incremental_refresh_constellation(st_mrs_age_w11, existing = "replace")

In this case, the corresponding star schema, the conformed dimensions and all the star schemas that share them are updated.

Refresh operations for data according to cause

Similar to how it has been done for age data, it can be done for cause data, as shown below.

mrs_cause_definition <-
  function(ft,
           dm,
           updates,
           tb_where,
           tb_where_state,
           tb_where_county) {
    star_schema(ft, dm) %>%
      snake_case() %>%
      character_dimensions(
        NA_replacement_value = "Unknown",
        length_integers = list(
          week = 2,
          data_availability_week = 2,
          reception_week = 2
        )
      ) %>%
      role_playing_dimension(
        dim_names = c("when", "when_received", "when_available"),
        name = "when_common",
        attributes = c("date", "week", "year")
      ) %>%
      rename_dimension_attributes(
        name = "when",
        attributes = c("week_ending_date", "week", "year"),
        new_names = c(
          "when_happened_date",
          "when_happened_week",
          "when_happened_year"
        )
      ) %>%
      rename_dimension_attributes(
        name = "where",
        attributes = c("region"),
        new_names = c("division")
      ) %>%
      modify_dimension_records(updates) %>%
      enrich_dimension_import(name = "where", tb_where) %>%
      enrich_dimension_import(name = "where", tb_where_state) %>%
      enrich_dimension_import(name = "where", tb_where_county)
  }

st_mrs_cause_w10 <-
  mrs_cause_definition(
    mrs_cause_w10,
    dm_mrs_cause,
    updates_st_mrs_age,
    tb_where,
    tb_where_state,
    tb_where_county
  )

st_mrs_cause_w11 <-
  mrs_cause_definition(
    mrs_cause_w11,
    dm_mrs_cause,
    updates_st_mrs_age,
    tb_where,
    tb_where_state,
    tb_where_county
  )

st_mrs_cause <- st_mrs_cause %>%
  incremental_refresh_star_schema(st_mrs_cause_w10, existing = "group") %>%
  incremental_refresh_star_schema(st_mrs_cause_w11, existing = "group")

ct_mrs <- ct_mrs %>%
  incremental_refresh_constellation(st_mrs_cause_w10, existing = "group") %>%
  incremental_refresh_constellation(st_mrs_cause_w11, existing = "group")

In this case, the previously existing data is treated differently than it was in the previous case, now what is done is grouping it using the defined aggregation functions, assuming it is additional data that has not been entered before (value “group” in existing parameter).

Filter and purge operations

Sometimes the data refresh consists of eliminating data that is no longer necessary, generally because it corresponds to a period that has stopped being analysed but it can also be for other reasons. In general, they can be selected considering any combination of dimensions, not just the time dimension.

Suppose we want to delete the Boston data working at the constellation level. First, we select them considering the stars that make up the constellation.

st1 <- ct_mrs %>%
  get_star_schema("mrs_age") %>%
  filter_fact_rows(name = "where", city == "Boston")

st2 <- ct_mrs %>%
  get_star_schema("mrs_cause") %>%
  filter_fact_rows(name = "where", city == "Boston")

We can work both at the star or constellation level. In this example, we are going to do it at the constellation level (working on a temporary variable so as not to lose data), as we have been doing.

ct_tmp <- ct_mrs %>%
  incremental_refresh_constellation(st1, existing = "delete") %>%
  incremental_refresh_constellation(st2, existing = "delete")

These operations have only removed the fact data. The first records of the where dimension are shown below.

where_key division state city division_name region region_name state_name county
1 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield
2 1 CT Hartford New England 1 Northeast Connecticut Hartford
3 1 CT New Haven New England 1 Northeast Connecticut New Haven
4 1 CT Waterbury New England 1 Northeast Connecticut New Haven
5 1 MA Boston New England 1 Northeast Massachusetts Suffolk
6 1 MA Cambridge New England 1 Northeast Massachusetts Middlesex

Once the fact data is removed, we can remove the data for the dimensions that are no longer needed using the following function.

ct_tmp <- ct_tmp %>%
  purge_dimensions_constellation()

The result in the where dimension is shown below.

where_key division state city division_name region region_name state_name county
1 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield
2 1 CT Hartford New England 1 Northeast Connecticut Hartford
3 1 CT New Haven New England 1 Northeast Connecticut New Haven
4 1 CT Waterbury New England 1 Northeast Connecticut New Haven
5 1 MA Cambridge New England 1 Northeast Massachusetts Middlesex
6 1 MA Fall River New England 1 Northeast Massachusetts Bristol

The Boston data has disappeared and the surrogate keys have been reassigned (on the temporary variable).

Exporting results

Once we have made the necessary definitions and transformations, we can export the data to work in a database or with a query tool.

Instead of exporting data in the specific format of a particular tool, it is exported as tibble-based structures that can be easily handled.

Star schema

Various export possibilities are offered. Specifically, for a star schema one of them is to export the data as a flat table. The main difference from the initial data is that we have cleaned and conformed it. This operation is offered for completeness. To work only with flat tables, this package is not suitable.

To work with databases, it is useful to be able to export a star schema as a list of tibble with dimension and fact tables, as shown below.

tl <- st_mrs_age %>%
  star_schema_as_tibble_list()

Optionally, the export function allows the role playing dimensions to be included.

Constellation

To export constellation data, as well as a tibble list, the multistar format may be interesting, where you have a list of tibble for fact tables and another for dimension tables.

ms_mrs <- ct_mrs %>%
  constellation_as_multistar()

multistar

We can obtain a flat table, implemented using a tibble, from a multistar (which can be the result of a query). If it only has one fact table, it is not necessary to provide its name.

ft <- ms_mrs %>%
  multistar_as_flat_table(fact = "mrs_age")

The first rows of the flat table obtained as a result are shown below.

age_range wide_age_range division state city division_name region region_name state_name county data_availability_date data_availability_week data_availability_year when_happened_date when_happened_week when_happened_year n_deaths nrow_agg
1: <1 year 0-24 years 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield 1962-01-15 03 1962 1962-01-06 01 1962 3 1
1: <1 year 0-24 years 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield 1962-01-17 03 1962 1962-01-13 02 1962 1 1
1: <1 year 0-24 years 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield 1962-01-23 04 1962 1962-01-20 03 1962 5 1
1: <1 year 0-24 years 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield 1962-01-28 04 1962 1962-01-27 04 1962 3 1
1: <1 year 0-24 years 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield 1962-02-05 06 1962 1962-02-03 05 1962 6 1
1: <1 year 0-24 years 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield 1962-02-23 08 1962 1962-02-10 06 1962 3 1

Query functions

The main motivation of the query functions of this package is to have the possibility to select subsets of data to be exported or presented through other packages.

We can define queries on data in multistar format. Data in this format can be obtained from a star or a constellation.

ms_mrs <- ct_mrs %>%
  constellation_as_multistar()

The query is created using dimensional_query and executed using run_query. We can refine it using select_dimension, select_fact and filter_dimension as much as we deem appropriate.

ms <- dimensional_query(ms_mrs) %>%
  select_dimension(name = "where",
                   attributes = c("city", "state")) %>%
  select_dimension(name = "when",
                   attributes = c("when_happened_year")) %>%
  select_fact(name = "mrs_age",
              measures = c("n_deaths")) %>%
  select_fact(
    name = "mrs_cause",
    measures = c("pneumonia_and_influenza_deaths", "other_deaths")
  ) %>%
  filter_dimension(name = "when", when_happened_week <= "03") %>%
  filter_dimension(name = "where", city == "Bridgeport") %>%
  run_query()

The result of a query is an multistar structure. In this way, we can define queries on the result of others. The result can also be transformed into a flat table.

ft <- ms %>%
  multistar_as_flat_table()

The content of the flat table is shown below.

when_happened_year city state n_deaths nrow_agg pneumonia_and_influenza_deaths other_deaths mrs_cause_nrow_agg
1962 Bridgeport CT 129 12 7 122 3

In the result it can be seen that, as the query has been defined at the year and city level, for a single city, with data available for a single year, there is only one row. Columns nrow_agg and mrs_cause_nrow_agg show the number of original rows that make up the result row.

From the result in the form of a flat table, pivottabler package can be used to present it in the form of a pivot table.

Available transformation operations

Package starschemar offers operations to transform flat tables into star schemas and also to export or exploit them through queries.

  1. From a flat table, we define a dimensional model classifying its attributes as facts or dimensions (dimensional modelling).

  2. From a flat table and a dimensional model we obtain a star schema that we can transform; from various star schemas we can define a constellation (star schema and constellation definition).

  3. Dimensions contain rows without duplicates, we can apply operations to perform data cleaning and to conform them (cleaning and conforming data).

  4. When new data is obtained, it is necessary to refresh the existing data with them by means of incremental refresh operations (incremental refresh).

  5. The results obtained can be exported to be consulted with other tools (exporting results).

  6. Finally, basic queries can be performed from R (query functions), especially to select the data to export.

Dimensional modelling

Starting from a flat table, a dimensional model is defined specifying the attributes that make up each of the dimensions and the measurements in the facts. The result is a dimensional_model object. It is carried out through the following functions:

dm <- dimensional_model()
dm <- dimensional_model() %>%
  define_dimension(name = "When",
                   attributes = c("Week Ending Date",
                                  "WEEK",
                                  "Year"))
dm <- dimensional_model() %>%
  define_fact(
    name = "mrs_age",
    measures = c("Deaths"),
    agg_functions = c("SUM"),
    nrow_agg = "nrow_agg"
  )

dm <- dimensional_model() %>%
  define_fact(name = "Factless fact")

Star schema and constellation definition

A dimensional model is implemented using a star schema. We can have several related star schemas through common dimensions that together form a fact constellation.

Star schema definition

A star schema is defined from a flat table and a dimensional model definition. Once defined, a star schema can be transformed by defining role playing dimensions, changing the writing style of element names or the type of dimension attributes. These operations are carried out through the following functions:

  • star_schema(): Creates a star_schema object from a flat table (implemented by a tibble) and a dimensional_model object. Example:
st <- star_schema(mrs_age, dm_mrs_age)
  • role_playing_dimension(): Given a list of star_schema dimension names, all with the same structure, a role playing dimension with the indicated name and attributes is generated. The original dimensions become role dimensions defined from the new role playing dimension. Example:
st <- star_schema(mrs_age, dm_mrs_age) %>%
  role_playing_dimension(
    dim_names = c("when", "when_available"),
    name = "When Common",
    attributes = c("Date", "Week", "Year")
  )
  • snake_case(): Transform fact, dimension, measurement, and attribute names according to the snake case style. Example:
st <- star_schema(mrs_age, dm_mrs_age) %>%
  snake_case()
  • character_dimensions(): Transforms numeric type attributes of dimensions into character type. In a star_schema numerical data are measurements that are situated in the facts. Numerical data in dimensions are usually codes, day, week, month or year numbers. There are tools that consider any numerical data to be a measurement, for this reason it is appropriate to transform the numerical data of dimensions into character data. It also allows indicating the literal to be used in case the numerical value is not defined. Example:
st <- star_schema(mrs_age, dm_mrs_age) %>%
  character_dimensions()

Star schema rename

Once a star schema is defined, we can rename its elements. It is necessary to be able to rename attributes of dimensions and measures of facts because the definition operations only allowed us to select columns of a flat table. For completeness also dimensions and facts can be renamed. To carry out these operations, the following functions are available:

  • rename_dimension(): Set new name for a dimension. Example:
st <- st_mrs_age %>%
  rename_dimension(name = "when", new_name = "when_happened")
  • get_dimension_attribute_names(): Get the name of attributes in a dimension, so that it is easier to modify them if necessary. Example:
attribute_names <- 
  st_mrs_age %>% get_dimension_attribute_names("when")
  • rename_dimension_attributes(): Set new names of some attributes in a dimension. Example:
st <-
  st_mrs_age %>% rename_dimension_attributes(
    name = "when",
    attributes = c("when_happened_week", "when_happened_year"),
    new_names = c("week", "year")
  )
  • rename_fact(): Set new name for facts. Example:
st <- st_mrs_age %>% rename_fact("age") 
  • get_measure_names(): Get the name of the measures in fact, so that it is easier to modify them if necessary. Example:
measure_names <- 
  st_mrs_age %>% get_measure_names()
  • rename_measures(): Set new names of some measures in facts. Example:
st <-
  st_mrs_age %>% rename_measures(measures = c("n_deaths"),
                                 new_names = c("num_deaths"))

Constellation definition

Based on various star schemas, a constellation can be defined in which star schemas share common dimensions. Dimensions with the same name must be shared. It is defined by the following function:

  • constellation(): Creates a constellation object from a list of star_schema objects. All dimensions with the same name in the star schemas have to be conformable. Example:
ct <- constellation(list(st_mrs_age, st_mrs_cause), name = "mrs")

Cleaning and conforming data

Once star schemas and fact constellations are defined, data cleaning operations can be carried out on dimensions. There are three groups of functions:

  1. One to obtain components of star schemas and constellations.

  2. Another to define data cleaning operations over dimensions.

  3. One more to apply operations to star schemas or constellations.

Obtaining components

We can obtain dimensions from a star schema or conformed dimensions from a fact constellation. Available functions in both cases are similar.

Star schema

  • get_dimension_names(): Get the names of the dimensions of a star schema. Role playing dimensions are not considered. Example:
dn <- st_mrs_age %>%
  get_dimension_names()
  • get_dimension(): Get a dimension of a star schema given its name. Role dimensions can be obtained but not role playing dimensions. Example:
where <- st_mrs_age %>%
  get_dimension("where")

Constellation

  • get_conformed_dimension_names(): Get the names of the conformed dimensions of a constellation. Example:
dn <- ct_mrs %>%
  get_conformed_dimension_names()
  • get_conformed_dimension(): Get a conformed dimension of a constellation given its name. Example:
when <- ct_mrs %>%
  get_conformed_dimension("when")
  • get_star_schema_names(): Get the names of the star schemas of a constellation. Example:
stn <- ct_mrs %>%
  get_star_schema_names()
  • get_star_schema(): Get a star schema of a constellation given its name. Example:
age <- ct_mrs %>%
  get_star_schema("mrs_age")

Definition of updates

Modifications are defined on dimension rows in various ways based exclusively on the values of the dimension fields. Although the surrogate key intervenes in the definition, the result, internally, does not depend on it so that it can be applied more generally in other star schemas.

  • record_update_set(): A record_update_set object is created. Stores updates on dimension records. Each update is made up of a dimension name, an old value set, and a new value set. Example:
updates <- record_update_set()
  • match_records(): For a dimension, given the primary key of two records, it adds an update to the set of updates that modifies the combination of values of the rest of attributes of the first record so that they become the same as those of the second. Example:
updates <- record_update_set() %>%
  match_records(dimension = where,
                old = 1,
                new = 2)
  • update_record(): For a dimension, given the primary key of one record, it adds an update to the set of updates that modifies the combination of values of the rest of attributes of the selected record so that they become those given. Example:
updates <- record_update_set() %>%
  update_record(
    dimension = who,
    old = 1,
    values = c("1: <1 year")
  )
  • update_selection(): For a dimension, given a vector of column names, a vector of old values and a vector of new values, it adds an update to the set of updates that modifies all the records that have the combination of old values in the columns with the new values in those same columns. Example:
updates <- record_update_set() %>%
  update_selection(
    dimension = where,
    columns = c("city"),
    old_values = c("Bridgepor"),
    new_values = c("Bridgeport")
  )
  • update_selection_general(): For a dimension, given a vector of column names, a vector of old values for those columns, another vector column names, and a vector of new values for those columns, it adds an update to the set of updates that modifies all the records that have the combination of old values in the first column vector with the new values in the second column vector. Example:
updates <- record_update_set() %>%
  update_selection_general(
    dimension = where,
    columns_old = c("state", "city"),
    old_values = c("CT", "Bridgepor"),
    columns_new = c("city"),
    new_values = c("Bridgeport")
  )

Updates application

Defined updates can be applied on a star schema or on the conformed dimension of a fact constellation.

Star schema

  • modify_dimension_records(): Given a list of dimension record update operations, they are applied on the dimensions of the star_schema object. Update operations must be defined with the set of functions available for that purpose. Example:
st <- st_mrs_age %>%
  modify_dimension_records(updates_st_mrs_age)

Constellation

  • modify_conformed_dimension_records(): Given a list of dimension record update operations, they are applied on the conformed dimensions of the constellation object. Update operations must be defined with the set of functions available for that purpose. Example:
ct <- ct_mrs %>%
  modify_conformed_dimension_records(updates_st_mrs_age)

Dimension enrichment

To enrich a dimension with new attributes related to others already included in it, first, we export the attributes on which the new ones depend, then we define the new attributes, and import the table with all the attributes to be added to the dimension.

tb <-
  enrich_dimension_export(st_mrs_age,
                          name = "when_common",
                          attributes = c("week", "year"))
tb <-
  enrich_dimension_export(st_mrs_age,
                          name = "when_common",
                          attributes = c("week", "year"))

# Add new columns with meaningful data (these are not), possibly exporting
# data to a file, populating it and importing it.
tb <- tibble::add_column(tb, x = "x", y = "y", z = "z")

st <- enrich_dimension_import(st_mrs_age, name = "when_common", tb)
tb <-
  enrich_dimension_export(st_mrs_age,
                          name = "when_common",
                          attributes = c("week", "year"))

# Add new columns with meaningful data (these are not), possibly exporting
# data to a file, populating it and importing it.
tb <- tibble::add_column(tb, x = "x", y = "y", z = "z")[-1, ]

tb2 <- enrich_dimension_import_test(st_mrs_age, name = "when_common", tb)

Incremental refresh

When new data is obtained, an incremental refresh of the data can be carried out, both of the dimensions and of the facts. Incremental refresh can be applied to both star schema and fact constellation, using the following functions.

Star schema

  • incremental_refresh_star_schema(): Incrementally refresh a star schema with the content of a new one that is integrated into the first. Once the dimensions are integrated, if there are records in the fact table whose keys match the new ones, new ones can be ignored, they can be replaced by new ones, all of them can be grouped using the aggregation functions, or they can be deleted. Therefore, the possible values of the existing parameter are: “ignore”, “replace”, “group” or “delete”. Example:
st <- st_mrs_age %>%
  incremental_refresh_star_schema(st_mrs_age_w10, existing = "replace")

Sometimes the data refresh consists of eliminating data that is no longer necessary, generally because it corresponds to a period that has stopped being analysed but it can also be for other reasons. This data can be selected using the following function:

  • filter_fact_rows(): Filter fact rows based on dimension conditions in a star schema. Dimensions remain unchanged. Filtered rows can be deleted using the incremental_refresh_star_schema function. Example:
st <- st_mrs_age %>%
  filter_fact_rows(name = "when", when_happened_week <= "03") %>%
  filter_fact_rows(name = "where", city == "Bridgeport")

st2 <- st_mrs_age %>%
  incremental_refresh_star_schema(st, existing = "delete")

Once the fact data is removed (using the other incremental refresh functions), we can remove the data for the dimensions that are no longer needed using the following function:

  • purge_dimensions_star_schema(): Delete instances of dimensions not related to facts in a star schema. Example:
st3 <- st2 %>%
  purge_dimensions_star_schema()

Constellation

  • incremental_refresh_constellation(): Incrementally refresh a star schema in a constellation with the content of a new star schema that is integrated into the first. Example:
ct <- ct_mrs %>%
  incremental_refresh_constellation(st_mrs_age_w10, existing = "replace")
  • purge_dimensions_constellation(): Delete instances of dimensions not related to facts in a star schema. It performs the operation for each of the component star_schemas and also for the conformed dimensions. Example:
ct <- ct_mrs %>%
  purge_dimensions_constellation()

Exporting results

Once the data has been properly structured and transformed, it can be exported to be consulted with other tools or with R. Various export formats have been defined, both for star schemas and for constellations, using the following functions.

Star schema

  • star_schema_as_flat_table(): We can again obtain a flat table, implemented using a tibble, from a star schema. Example:
ft <- st_mrs_age %>%
  star_schema_as_flat_table()
  • star_schema_as_multistar(): We can obtain a multistar. A multistar only distinguishes between general and conformed dimensions, each dimension has its own data. It can contain multiple fact tables. Example:
ms <- st_mrs_age %>%
  star_schema_as_multistar()
  • star_schema_as_tibble_list(): We can obtain a tibble list with them. Role playing dimensions can be optionally included. Example:
tl <- st_mrs_age %>%
  star_schema_as_tibble_list(include_role_playing = TRUE)

Constellation

  • constellation_as_multistar(): We can obtain a multistar. A multistar only distinguishes between general and conformed dimensions, each dimension has its own data. It can contain multiple fact tables. Example:
ms <- ct_mrs %>%
  constellation_as_multistar()
  • constellation_as_tibble_list(): We can obtain a tibble list with them. Role playing dimensions can be optionally included. Example:
tl <- ct_mrs %>%
  constellation_as_tibble_list(include_role_playing = TRUE)

multistar

  • multistar_as_flat_table(): We can obtain a flat table, implemented using a tibble, from a multistar (which can be the result of a query). If it only has one fact table, it is not necessary to provide its name. Example:
ft <- ms_mrs %>%
  multistar_as_flat_table(fact = "mrs_age")

Query functions

There are many multidimensional query tools available. The exported data, once stored in files, can be used directly from them. You can also perform basic queries from R on data in the multistar format, mainly for selecting the data to export, using the following functions:

ms_mrs <- ct_mrs %>%
  constellation_as_multistar()

dq <- dimensional_query(ms_mrs)
dq <- dimensional_query(ms_mrs) %>%
  select_fact(
    name = "mrs_age",
    measures = c("n_deaths"),
    agg_functions = c("MAX")
  )

dq <- dimensional_query(ms_mrs) %>%
  select_fact(name = "mrs_age",
              measures = c("n_deaths"))

dq <- dimensional_query(ms_mrs) %>%
  select_fact(name = "mrs_age")
dq <- dimensional_query(ms_mrs) %>%
  select_dimension(name = "where",
                   attributes = c("city", "state")) %>%
  select_dimension(name = "when")
dq <- dimensional_query(ms_mrs) %>%
  filter_dimension(name = "when", when_happened_week <= "03") %>%
  filter_dimension(name = "where", city == "Boston")
ms <- dimensional_query(ms_mrs) %>%
  select_dimension(name = "where",
                   attributes = c("city", "state")) %>%
  select_dimension(name = "when",
                   attributes = c("when_happened_year")) %>%
  select_fact(
    name = "mrs_age",
    measures = c("n_deaths")
  ) %>%
  select_fact(
    name = "mrs_cause",
    measures = c("pneumonia_and_influenza_deaths", "other_deaths")
  ) %>%
  filter_dimension(name = "when", when_happened_week <= "03") %>%
  filter_dimension(name = "where", city == "Boston") %>%
  run_query()

Conclusions

starschemar package offers a set of operations that allow us to transform flat tables into star schemas. Star schemas support the definition of role playing and role dimensions. Additional transformation operations can be applied to each star schema to adapt the format of the data. From several star schemas you can define fact constellation with conformed dimensions.

Cleaning and conforming data operations can be defined on the star schemas and fact constellation. To update the data, incremental refresh operations are offered, also applicable on said structures. In addition, there are several possibilities to export the results obtained in the form of easily treatable tibble-based structures. Operations are also provided to query the multidimensional structure, mainly for selecting the data to export.

Operations have been designed to be intuitive and easy to use. The result greatly facilitates the data transformation process for the exposed situation.

References

Adamson, Christopher. 2010. Star Schema: The Complete Reference. McGraw Hill Professional.

Jensen, Christian S., Torben Bach Pedersen, and Christian Thomsen. 2010. Multidimensional Databases and Data Warehousing. Morgan & Claypool Publishers.

Kimball, Ralph, and Margy Ross. 2013. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. John Wiley & Sons.


  1. I know this data set thanks to Alberto J. Duran, student of Multidimensional Systems during the 2019-2020 academic year, a subject I teach at the University of Granada (Spain), for a work he developed based on it, tutored by me.↩︎

  2. The transformation has been carried out with tidyverse and flattabler packages.↩︎

  3. SaveRDS and readRDS functions can be used to save and retrieve star schemas or any other defined data structure.↩︎

  4. This is how I have located the problem with the value “Wilimington” in the city attribute, which we have already corrected, so it no longer appears.↩︎