Wednesday, December 14, 2011

Calculating the prediction band of a linear regression model


The method below calculates the prediction band of an arbitrary linear regression model at a given confidence level in Python.

If you use it, let me know if you find any bugs.




  1. def predband(xd,yd,a,b,conf=0.95,x=None):
  2.     """
  3. Calculates the prediction band of the linear regression model at the desired confidence
  4. level.
  5. Clarification of the difference between confidence and prediction bands:
  6. "The 2sigma confidence interval is 95% sure to contain the best-fit regression line.
  7. This is not the same as saying it will contain 95% of the data points. The prediction bands are
  8. further from the best-fit line than the confidence bands, a lot further if you have many data
  9. points. The 95% prediction interval is the area in which you expect 95% of all data points to fall."
  10. (from http://graphpad.com/curvefit/linear_regression.htm)
  11. Arguments:
  12. - conf: desired confidence level, by default 0.95 (2 sigma)
  13. - xd,yd: data arrays
  14. - a,b: linear fit parameters as in y=ax+b
  15. - x: (optional) array with x values to calculate the confidence band. If none is provided, will
  16.  by default generate 100 points in the original x-range of the data.
  17.  
  18. Usage:
  19. >>> lpb,upb,x=nemmen.predband(all.kp,all.lg,a,b,conf=0.95)
  20. calculates the prediction bands for the given input arrays
  21. >>> pylab.fill_between(x, lpb, upb, alpha=0.3, facecolor='gray')
  22. plots a shaded area containing the prediction band  
  23. Returns:
  24. Sequence (lpb,upb,x) with the arrays holding the lower and upper confidence bands
  25. corresponding to the [input] x array.
  26. References:
  27. 1. http://www.JerryDallal.com/LHSP/slr.htm, Introduction to Simple Linear Regression, Gerard
  28. E. Dallal, Ph.D.
  29. Rodrigo Nemmen
  30. v1 Dec. 2011
  31. v2 Jun. 2012: corrected bug in dy.
  32.     """
  33.     alpha=1.-conf   # significance
  34.     n=xd.size   # data sample size
  35.     if x==None: x=numpy.linspace(xd.min(),xd.max(),100)
  36.     # Predicted values (best-fit model)
  37.     y=a*x+b
  38.     # Auxiliary definitions
  39.     sd=scatterfit(xd,yd,a,b)    # Scatter of data about the model
  40.     sxd=numpy.sum((xd-xd.mean())**2)
  41.     sx=(x-xd.mean())**2 # array
  42.     # Quantile of Student's t distribution for p=1-alpha/2
  43.     q=scipy.stats.t.ppf(1.-alpha/2.,n-2)
  44.     # Prediction band
  45.     dy=q*sd*numpy.sqrt( 1.+1./n + sx/sxd )
  46.     upb=y+dy    # Upper prediction band
  47.     lpb=y-dy    # Lower prediction band
  48.     return lpb,upb,x

2 comments:

  1. What is scatterfit(xd,yd,a,b)?

    ReplyDelete
  2. It is a method that computes the mean deviation of the data about a given linear model.

    ReplyDelete