% df = load_pkg_stat_snapshot() %>
In this report, we study two aspects of the package dependencies:
For each package, we first look at the maximal heaviness from its parents. Following plots show the relation between number of parents and max heaviness from parents. Generally, on the border of the point cloud, there is a trend that max heaviness from parents drops as numbers of parents increase. This is because when a package has more parents, additional dependency packages brought by each parent would have more overlap (i.e. dependencies from parent A overlap to the dependencies from parent B). Since heaviness measures number of unique dependencies that a single parent brings in, or in other words, the number of dependencies that are mutually exclusive to those brought by all other parents, thus with more parents, the max heaviness from parents would decrease.
In the plot, we can see there are several packages far away from the cloud (highlighted in red and orange). These packages can be thought as those having extreamly heavy parents compared to most of the others. To capture these packages with heavy parents, we define "adjusted max heaviness on parent packages" as follows.
For a package P, denote $h$ as the max heaviness from its parent packages. The adjusted heaviness is calculated as $ h^{adj} = h \cdot a $ where $a$ is a zooming factor. $a$ is calculated as $a = n/n_{max}$ where $n$ is the number of parents for package P and $n_{max}$ is the maximal number of parents of all packages (i.e. all CRAN/Bioconductor packages).
The zooming factor $a$ decreases the heaviness more for small number of parents, thus, it actually transforms the original distribution of point cloud more horizontal so that it is easy to set a cutoff to mark extream points. The plot of adjusted heaviness verse number of parents can be seen by clicking the radio button "Adjusted max heaviness verse number of parent packages" below. We simply mark a package as having highly heavy parents if the adjusted heaviness larger than <%=CUTOFF$adjusted_max_heaviness_from_parents[2]%> and having median heavy parents if the adjusted heaviness is between <%=CUTOFF$adjusted_max_heaviness_from_parents[1]%> and <%=CUTOFF$adjusted_max_heaviness_from_parents[2]%>. The packages with highly heavy parents are listed on the right of the following figure.
<%= img(paste0(env$figure_dir, "/plot-parent-max-heaviness.png"), style="height:500px")%>
A package may have more than one heavy parents, thus, we next look at the total heaviness from all parents for a package. Note, heaviness from a parent measures the number of additional unique packages it brings in, that are not brought by any of the other parents, therefore, total heaviness from parents is actually the number of dependency packages that are brought by only one parent package. Generally, for a package, majority of its parents only contribute very small heaviness while only a few parents (mostly 1 ~ 3) contribute high heaviness. Thus, the "total heaviness from all parents" can be approximately treated as "total heaviness from heavy parents".
Similarly, we define an "adjusted total heaviness from parents" to adjust the point distribution more horizontally. It is defined as:
$h^{adj} = h \cdot a$, where $a = \sqrt{n}/\sqrt{n_{max}}$. Note here $h$ is the total heaviness from parents for package P.
The plot of adjusted heaviness verse number of parents can be seen by clicking the radio button "Adjusted total heaviness verse number of parent packages". We simply set a package as having highly heavy parents if the adjusted heaviness larger than <%=CUTOFF$adjusted_total_heaviness_from_parents[2]%> and having median heavy parents if the adjusted heaviness is between <%=CUTOFF$adjusted_total_heaviness_from_parents[1]%> and <%=CUTOFF$adjusted_total_heaviness_from_parents[2]%>.
<%= img(paste0(env$figure_dir, "/plot-parent-total-heaviness.png"), style="height:500px")%>
According to both figures in this section, CRAN packages have very similar trends for the max and total heaviness from parents. But Bioconductor packages in general have more heavy parents, e.g. musicatk and singleCellTK.
Generally, the heaviness on child packages has a trend to decrease with increasing the number of child packages, since it is averaged on the heaviness of all children. To highlight the packages that heavily affect large numbers of children, the original definition of heaviness is adjusted. The original definition of heaviness on child packages is defined as:
For a package P,
assume it has $K$ child packages and the $k^{th}$ child is denoted
as $A_k$. Denote $n_{1k}$ as the number of strong dependencies of package
$A_k$ and $n_{2k}$ as the number of strong dependencies of $A_k$ if moving P
to its Suggests, the heaviness of P on its child packages denoted as $h$ is
calculated as $h = \frac{1}{K} \sum_k^K(n_{1k} - n_{2k})$, which is the average heaviness to all its child packages.
Since the original heaviness is scaled by the number of children, it is possible that large $K$ generates a small heaviness. The heaviness on child package is adjusted by adding a small constant $a$ to $K$, so that heaviness for small $K$ decreases more quickly than large $K$.
$h^{adj} = \frac{1}{K + a} \sum_k^K(n_{1k} - n_{2k})$
We emperically select 10 for $a$. Clicking on the following title to see how $a$ is selected.
If we rank packages by heavines, the value of $a$ affects the ranking more to the packages with small $K$. We take $a$ as integers in the list [0, 1, ..., 29, 30], and for each $a$ and a specific package P, we calculate the adjusted heaviness on its children, denoted as $h^{adj}_{a, p}$. Note $h^{adj}_{0, p} = h_p$. $a$ is selected as the one by which ranking of all packages becomes stable. To measure the stability of the ranking of $h^{adj}_{a}$ compared to previous $a$, we calculate a score denoted as $v$: $v = \sum_p^N I(|rank(h^{adj}_{a,p}) - rank(h^{adj}_{a-1,p})| > 50)$, where $N$ is total number of all packages and $I()$ is an indicator function. $v$ measures the number of packages whose ranking difference is larger than 50 (50 is a very small number compared to the total number of R packages in this analysis, which is 21741) in the two neighbour values of $a$. When $v$ becomes stable with $a$, we say increasig $a$ won't change the ranking of $v$ too much.
The following plot shows the relation of score $v$ and $a$. A value of 10 is taken as the optimized value of $a$, since it is the elbow of the curve.
<%= img(paste0(env$figure_dir, "/plot-select-a-adjusted-heaviness-children.png"), style="height:500px")%>
The plot of adjusted heaviness verse number of children can be seen by clicking the radio button "Adjusted heaviness verse number of child packages". We simply set a package having a highly heavy impact on its children if the adjusted heaviness larger than <%=CUTOFF$adjusted_heaviness_on_children[2]%> and having a median heavy impact if the adjusted haviness is between <%=CUTOFF$adjusted_heaviness_on_children[1]%> and <%=CUTOFF$adjusted_heaviness_on_children[2]%>.
<%= img(paste0(env$figure_dir, "/plot-child-heaviness.png"), style="height:500px")%>
The analysis of heaviness on child packages is more useful for developers because it tells when you add a new direct dependency
package to your package, the expected number of additional dependency package it brings to. E.g. if you add
lumi in Imports of your package, your package will likely 
have 111 more extra dependency packages.
We next look at the indirect affect to the downstream packages' dependencies. Note here we only look at the downstream packages with excluding the child packages. A comparison of with including child packages can be found in the next section of this report.
Similar to the heaviness on child packages, heaviness on indirect downstream packages also decrease as the number of downstream packages increase. We also define "adjusted heaviness on indirect downstream packages". The original definition of heaviness on indirect downstream packages is as follows:
For a package P, assume it has $K$ downstream packages (also include child packages) and the $k^{th}$ downstream package is denoted
as $B_k$. Denote $n_{1k}$ as the number of strong dependencies of package
$B_k$. Since P can affect its downstream in an indirect manner, 
we recalculate the global dependency relations for all packages by moving 
P to all its child packages' Suggests. Then we denote 
$n_{2k}$ as the number of strong dependencies of $B_k$ in the modified dependency graph.
Next we denote $S_c$ as the set of child packages of P and $K_c$ as the number of its child packages, thus $K \geq K_c$.
The adjusted heaviness of P on its indirect downstream packages (excluding child packages) denoted as $h$ is
calculated as: $h = \frac{1}{K-K_c} \sum_{k}^K(n_{1k} - n_{2k}) \cdot I(B_{k} \notin S_c)$ where $I()$ is an indicator function. $h$ is set to 0 when $K = K_c$.
Then a small constant $a$ is added to $K - K_c$ to adjust the original heaviness:
$h^{adj} = \frac{1}{K-K_c + a} \sum_{k}^K(n_{1k} - n_{2k}) \cdot I(B_{k} \notin S_c)$
We emperically select 6 for $a$. Clicking on the following title to see how $a$ is selected.
If we rank packages by heavines, the value of $a$ affects the ranking more to the packages with small $K$. We take $a$ as integers in the list [0, 1, ..., 29, 30], and for each $a$ and a specific package P, we calculate the adjusted heaviness on its children, denoted as $h^{adj}_{a, p}$. Note $h^{adj}_{0, p} = h_p$. $a$ is selected as the one by which ranking of all packages becomes stable. To measure the stability of the ranking of $h^{adj}_{a}$ compared to previous $a$, we calculate a score denoted as $v$: $v = \sum_p^N I(|rank(h^{adj}_{a,p}) - rank(h^{adj}_{a-1,p})| > 50)$, where $N$ is total number of all packages and $I()$ is an indicator function. $v$ measures the number of packages whose ranking difference is larger than 50 (50 is a very small number compared to the total number of R packages in this analysis, which is 21741) in the two neighbour values of $a$. When $v$ becomes stable with $a$, we say increasig $a$ won't change the ranking of $v$ too much.
The following plot shows the changes of score $v$ and $a$. A value of 6 is taken as the optimized value of $a$, since it is the elbow of the curve.
<%= img(paste0(env$figure_dir, "/plot-select-a-adjusted-heaviness-downstream-no-children.png"), style="height:500px")%>
The plot of adjusted heaviness verse number of downstream can be seen by clicking the radio button "Adjusted heaviness verse number of indirect downstream packages". We simply set a package having a highly heavy impact on its indirect downstream packages if the adjusted heaviness larger than <%=CUTOFF$adjusted_heaviness_on_indirect_downstream[2]%> and having median heavy impact if the adjusted haviness is between <%=CUTOFF$adjusted_heaviness_on_indirect_downstream[1]%> and <%=CUTOFF$adjusted_heaviness_on_indirect_downstream[2]%>.
<%= img(paste0(env$figure_dir, "/plot-downstream-no-children-heaviness.png"), style="height:500px")%>
The figure shows CRAN packages have more affect on the dependencies of indirect downstream packages.
Each of The following two plots visualizes the ranking of all packages based on their heaviness on child packages and on downstream packages. For each plot, the left and right panels contain sorted heaviness for children and downstream respectively. In the middle panel are lines connecting the same package in the two rankings. The two ends of a line are assigned with the same color. There is a "Venn diagram" at the bottom panel which shows the overlap of the top 500 packages with the highest heaviness on children and the highest heaviness on downstream.
The left plot shows top 500 packages with the highest heaviness on children almost also have the highest heaviness on downstream (474 out of 500), and the right plot shows if only considering the indirect downstream packages, the overlap to packages with top heaviness on children have very small overlap.
<%= img(paste0(env$figure_dir, "/plot-compare-downstream-and-downstream2.png"), style="height:500px")%>
We think why there is such a huge overlap for the top packages with the highest heaviness on children and on downstream is because the downstream packages are mainly composed of child packages. To demonstrate this, for 474 packages that are both in the lists of top 500 packages having the highest heaviness on children and the list of 500 packages having the highest heaviness on downstream, we plot the fraction of their child packages in downstream packages. The following plot clearly shows for these top packages, their downstream packages are mostly child packages. With 76.4% of them, their downstream packages are completely child packages, and with 91.1% of them, more than 60% of their downstream packages are child packages.
<%= img(paste0(env$figure_dir, "/plot-top-500-children-downstream-pct.png"), style="height:500px")%>