Skip to main content
  1. posts/

Find the intersection of overlapping histograms in R

·508 words·3 mins· loading · loading · · ·
Data Science R Statistics Tutorial

Here, I demonstrate how to find the point where two histograms overlap. While this is an approximation, it seems to have a very high level of precision.

Prepare simulated data
#

I created two data sets, gamma_dist and norm_dist, which are made up of a different number of values sampled randomly from a gamma distribution and normal distribution, respectively. I specicially made the data sets different sizes to make the point that this method is still applicable.

library(tibble)

set.seed(0)

gamma_dist <- rgamma(1e5, shape = 2, scale = 2)
norm_dist <- rnorm(5e5, mean = 20, sd = 5)

df <- tibble(
  x = c(gamma_dist, norm_dist),
 original_dataset = c(rep("gamma_dist", 1e5), rep("norm_dist", 5e5))
)
df
#> # A tibble: 600,000 x 2
#>        x original_dataset
#>    <dbl> <chr>
#>  1  6.89 gamma_dist
#>  2  2.25 gamma_dist
#>  3  1.30 gamma_dist
#>  4  4.10 gamma_dist
#>  5  7.77 gamma_dist
#>  6  5.08 gamma_dist
#>  7  4.58 gamma_dist
#>  8  2.30 gamma_dist
#>  9  1.36 gamma_dist
#> 10  1.67 gamma_dist
#> # … with 599,990 more rows

I used ‘ggplot2’ to plot the densities of the two data sets. The gamma distribution is in red and the normal distribution is in blue. I broke the creation of the plot into two steps: the essential step to create the density curves, and the styling step to make the plot look nice. Of course, these could be combined into a single long ggplot statement.

library(ggplot2)

p <- ggplot(df) +
  geom_density(aes(x = x, color = original_dataset))

p <- p +
  scale_y_continuous(expand = expand_scale(mult = c(0, 0.05))) +
  scale_color_manual(values = c("tomato", "dodgerblue")) +
  theme_minimal() +
  theme(
      legend.title = element_blank(),
      plot.title = element_text(hjust = 0.5)
  ) +
  labs(x = "values",
       title = "Two density curves")

Finding the point of intersection
#

To find the point of intersection, I first binned the data sets using density. It is essential to use the same from and to values for each data set. The density function creates 512 bins, thus, providing the same starting and ending parameters makes density use the same bins for each data set.

from <- 0
to <- 40
gamma_density <- density(gamma_dist, from = from, to = to)
norm_density <- density(norm_dist, from = from, to = to)

The final step was to find where the density of the gamma distribution was less than the normal distribution. Therefore, I applied this logic to create the boolean vector idx. I also included two other filters to contain the result between 5 to 20 because, from the plot above, I can see that the intersection falls within this range.

idx <- (gamma_density$y < norm_density$y) &
       (gamma_density$x > 5) &
       (gamma_density$x < 20)
poi <- min(gamma_density$x[idx])
poi
#> 10.64579

That’s it, the point of intersection has been approximated to a high precision. A vertical line was added to the plot below at poi.

p <- p + 
  geom_vline(xintercept = poi, linetype = 2, size = 0.3, color = "black") +
  annotate(geom = "text", label = round(poi, 3),
           x = poi - 1, y = 0.1, size = 4, angle = 90)

Related

Setting axes to integer values in 'ggplot2'
·501 words·3 mins· loading · loading
Dev R Data Viz Tutorial
R Markdown animated GitHub corner
·332 words·2 mins· loading · loading
Dev R Tutorial
Experimenting with multi-level and hierarchical splines in PyMC
·8468 words·40 mins· loading · loading
Data Science Python Statistics Bayesian Modeling