70

I have managed to find online how to overlay a normal curve to a histogram in R, but I would like to retain the normal "frequency" y-axis of a histogram. See two code segments below, and notice how in the second, the y-axis is replaced with "density". How can I keep that y-axis as "frequency", as it is in the first plot.

AS A BONUS: I'd like to mark the SD regions (up to 3 SD) on the density curve as well. How can I do this? I tried abline, but the line extends to the top of the graph and looks ugly.

g = d$mydata
hist(g)

enter image description here

g = d$mydata
m<-mean(g)
std<-sqrt(var(g))
hist(g, density=20, breaks=20, prob=TRUE, 
     xlab="x-variable", ylim=c(0, 2), 
     main="normal curve over histogram")
curve(dnorm(x, mean=m, sd=std), 
      col="darkblue", lwd=2, add=TRUE, yaxt="n")

enter image description here

See how in the image above, the y-axis is "density". I'd like to get that to be "frequency".

qwr
  • 6,786
  • 3
  • 42
  • 72
StanLe
  • 4,607
  • 8
  • 34
  • 41
  • 2
    You could accomplish this by applying the strategy laid out in [this answer](http://stackoverflow.com/questions/13960896/add-density-lines-to-histogram-and-cumulative-histogram/13961565#13961565) – Josh O'Brien Nov 19 '13 at 17:37
  • Although I should add that the interpretation of "Frequency" for the continuous density curve will be really unclear. – Josh O'Brien Nov 19 '13 at 17:53
  • I understand, and am fine with that. The link you gave me works great, except it doesn't give a normal distribution but rather a density curve that has multiple inflection points. I'd like to get a normal like in the plot above. Any ideas? – StanLe Nov 19 '13 at 17:56
  • 1
    @StanLe just commenting to make sure you see my edit, which both apply my method to a normal density instead of an arbitrary density and add lines at the standard deviations. – Gregor Thomas Nov 19 '13 at 20:39
  • 1
    See [here](http://stackoverflow.com/a/36344354/4241780) for a ggplot2 option. – JWilliman Jun 01 '16 at 01:15
  • Hi Everyone, Has anyone done the above ggplot if so would it be possible to update the answer or direct me to one with a descriptive data set. Thanks in advance ! – BPDESILVA Apr 21 '19 at 20:49

4 Answers4

61

Here's a nice easy way I found:

h <- hist(g, breaks = 10, density = 10,
          col = "lightgray", xlab = "Accuracy", main = "Overall") 
xfit <- seq(min(g), max(g), length = 40) 
yfit <- dnorm(xfit, mean = mean(g), sd = sd(g)) 
yfit <- yfit * diff(h$mids[1:2]) * length(g) 

lines(xfit, yfit, col = "black", lwd = 2)
Gregor Thomas
  • 104,719
  • 16
  • 140
  • 257
StanLe
  • 4,607
  • 8
  • 34
  • 41
  • 1
    Nice! You can also use `freq = FALSE` in `hist` to get rid of the scaling of `yfit`. – Mikael Call Aug 21 '15 at 12:43
  • 5
    What is the significant of using h$mids[1:2] instead of the entire vector? – Zach Apr 08 '16 at 18:49
  • 1
    I believe the significance of h$mids[1:2] is just that it is used to calculate the size of the bins. As they are all the same size, finding the difference between just the first two gives us this. This wouldn't be necessary to do at all if the range of each bin was 1. – dpwr Sep 03 '17 at 17:54
  • 1
    It would nice if this code sample could be run by others. – baxx Oct 21 '18 at 11:47
  • @baxx See below answer for an implementation. It wraps around the existing `hist()` function. – MS Berends Mar 04 '19 at 10:05
  • The `yfit * diff(h$mids[1:2]) * length(g)` must be omitted when using densities (`hist(..., density = TRUE)`). Also, for dates you must use the numeric transformation of `yfit`: `as.double(yfit) * diff(h$mids[1:2]) * length(g)`. Just found out today. I implemented the fixes in my answer below (which is an implementation of yours). – MS Berends Mar 25 '19 at 13:19
31

You just need to find the right multiplier, which can be easily calculated from the hist object.

myhist <- hist(mtcars$mpg)
multiplier <- myhist$counts / myhist$density
mydensity <- density(mtcars$mpg)
mydensity$y <- mydensity$y * multiplier[1]

plot(myhist)
lines(mydensity)

enter image description here

A more complete version, with a normal density and lines at each standard deviation away from the mean (including the mean):

myhist <- hist(mtcars$mpg)
multiplier <- myhist$counts / myhist$density
mydensity <- density(mtcars$mpg)
mydensity$y <- mydensity$y * multiplier[1]

plot(myhist)
lines(mydensity)

myx <- seq(min(mtcars$mpg), max(mtcars$mpg), length.out= 100)
mymean <- mean(mtcars$mpg)
mysd <- sd(mtcars$mpg)

normal <- dnorm(x = myx, mean = mymean, sd = mysd)
lines(myx, normal * multiplier[1], col = "blue", lwd = 2)

sd_x <- seq(mymean - 3 * mysd, mymean + 3 * mysd, by = mysd)
sd_y <- dnorm(x = sd_x, mean = mymean, sd = mysd) * multiplier[1]

segments(x0 = sd_x, y0= 0, x1 = sd_x, y1 = sd_y, col = "firebrick4", lwd = 2)
Gregor Thomas
  • 104,719
  • 16
  • 140
  • 257
  • great! I was always looking for this solution. Now I realized that the problem was in the y scale of the density. – Darwin PC Feb 23 '20 at 21:49
4

This is an implementation of aforementioned StanLe's anwer, also fixing the case where his answer would produce no curve when using densities.

This replaces the existing but hidden hist.default() function, to only add the normalcurve parameter (which defaults to TRUE).

The first three lines are to support roxygen2 for package building.

#' @noRd
#' @exportMethod hist.default
#' @export
hist.default <- function(x,
                         breaks = "Sturges",
                         freq = NULL,
                         include.lowest = TRUE,
                         normalcurve = TRUE,
                         right = TRUE,
                         density = NULL,
                         angle = 45,
                         col = NULL,
                         border = NULL,
                         main = paste("Histogram of", xname),
                         ylim = NULL,
                         xlab = xname,
                         ylab = NULL,
                         axes = TRUE,
                         plot = TRUE,
                         labels = FALSE,
                         warn.unused = TRUE,
                         ...)  {

  # https://stackoverflow.com/a/20078645/4575331
  xname <- paste(deparse(substitute(x), 500), collapse = "\n")

  suppressWarnings(
    h <- graphics::hist.default(
      x = x,
      breaks = breaks,
      freq = freq,
      include.lowest = include.lowest,
      right = right,
      density = density,
      angle = angle,
      col = col,
      border = border,
      main = main,
      ylim = ylim,
      xlab = xlab,
      ylab = ylab,
      axes = axes,
      plot = plot,
      labels = labels,
      warn.unused = warn.unused,
      ...
    )
  )

  if (normalcurve == TRUE & plot == TRUE) {
    x <- x[!is.na(x)]
    xfit <- seq(min(x), max(x), length = 40)
    yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
    if (isTRUE(freq) | (is.null(freq) & is.null(density))) {
      yfit <- yfit * diff(h$mids[1:2]) * length(x)
    }
    lines(xfit, yfit, col = "black", lwd = 2)
  }

  if (plot == TRUE) {
    invisible(h)
  } else {
    h
  }
}

Quick example:

hist(g)

enter image description here

For dates it's bit different. For reference:

#' @noRd
#' @exportMethod hist.Date
#' @export
hist.Date <- function(x,
                      breaks = "months",
                      format = "%b",
                      normalcurve = TRUE,
                      xlab = xname,
                      plot = TRUE,
                      freq = NULL,
                      density = NULL,
                      start.on.monday = TRUE,
                      right = TRUE,
                      ...)  {

  # https://stackoverflow.com/a/20078645/4575331
  xname <- paste(deparse(substitute(x), 500), collapse = "\n")

  suppressWarnings(
    h <- graphics:::hist.Date(
      x = x,
      breaks = breaks,
      format = format,
      freq = freq,
      density = density,
      start.on.monday = start.on.monday,
      right = right,
      xlab = xlab,
      plot = plot,
      ...
    )
  )

  if (normalcurve == TRUE & plot == TRUE) {
    x <- x[!is.na(x)]
    xfit <- seq(min(x), max(x), length = 40)
    yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
    if (isTRUE(freq) | (is.null(freq) & is.null(density))) {
      yfit <- as.double(yfit) * diff(h$mids[1:2]) * length(x)
    }
    lines(xfit, yfit, col = "black", lwd = 2)
  }

  if (plot == TRUE) {
    invisible(h)
  } else {
    h
  }
}
MS Berends
  • 2,677
  • 1
  • 21
  • 35
0

Just remove the prob = T, and let it stay at default ie F