Reasons to not use Excel: http://www.jstatsoft.org/v34/i04 http://pages.stern.nyu.edu/~jsimonof/classes/1305/pdf/excelreg.pdf http://www.divms.uiowa.edu/~jcryer/JSMTalk2001.pdf http://www.daheiser.info/excel/frontpage.html ----------------- http://www.jeremymiles.co.uk/regressionbook/extras/appendix2/excel/ A Note to End On by Jeremy Miles I have written this file which shows you how to do regression in Excel, but this does not mean that I think that you should be doing regression in Excel. Regression in Excel has a number of shortcomings, which include: No standardised coefficients. It can be very difficult to interpret unstandardised coefficients. You could calculate the standardised coefficients using the unstandardised coefficients, if you really wanted to. But you could have done the regression on your own, if you really wanted to. Lack of diagnostic graphs. The standard diagnostic graphs are not available in Excel, e.g. the normality plot of the residuals, the scatterplot or residuals against predicted values. Again, you can work them out, but it ain't easy. Lack of Diagnostic statistics. There are no collinearity diagnostics, which would help you to understand what was happening in the data that we analysed above (highly significant R2, but no significant parameters in the model). You could calculate the skew statistics, but it's a pain. Lack of features. There is no hierarchical regression, no weighting cases, etc, etc, etc... Inflexibility. If you want to run a slightly different analysis, it is hard work, because you have to move your data around, a process which is prone to errors. ----------------- Here is a web page that will do multiple regression for you: http://www.wessa.net/rwasp_multipleregression.wasp ----------------- Here are some other things to consider for regression: There are no hard-and-fast rules for how big an R value must be to be important. Here is a general guideline, appropriate for the social sciences (psychology, political science/public policy, etc.) If R (not R^2) is around: 0.1,it's a small correlation 0.3, it's a medium correlation 0.5, it's a large correlation And the same values hold if R is negative (-0.1, -0.3, -0.5). Sometimes the words "weak" and "strong" are used instead of small and large. The technical sciences (physics, engineering, etc.) typically have more strict requirements. Remember, even if there's a large correlation, that doesn't mean that one thing is causing the other. And, in particular, if you get a correlation of exactly 1.0, there's a good chance you accidentally graphed some data against itself. (scale obtained from: Cohen, Jacob: Statistical power analysis for the behavioral sciences (2nd ed.) 1988 ) And here is another researcher's take on it: Thoughts on what makes a correlation large or small, for social-science/sports-science things: http://sportsci.org/resource/stats/effectmag.html#ps ----------------------------- Why did we take the time to learn how to do polynomial regression via multiple-variable regression, when Excel already does it for us with an easy graphical interface? Because if you want to do polynomial terms in a problem with more than one X variable, you have to do it the way we did, since there's no nice graphical interface for multi-variable regression. Also, you might want to add a term like (% free lunch)*(teacher salary) to see if there are any interactions between them. This is essentially doing the first few terms of the multivariable Taylor series. ----------------------------- How much data should you have when doing a multiple-variable regression? I'll start with a simple rule of thumb: "aim to have the number of data points be at least 6 times, and ideally at least 10 times, the number of X variables." http://www.basic.northwestern.edu/statguidefiles/mulreg_ass_viol.html#Few%20data%20points I'll add further: if your X variables are discrete (like ACT scores), then it's good to have multiple data points for each discrete value, so you can see how much variation there is. A paper cited farther below in this e-mail says that the "10 times" rule is actually not good enough. Instead, the paper offers this suggestion for those who want a rule of thumb (but argues against using any such rule of thumb) #variables sample_size_minimum 2 141 3 218 4 311 5 419 6 543 7 682 8 838 9 1009 10 1196 [can you find a reasonable function to fit that data set?] In the past, I have forgotten to tell students about guidelines for sample size (# points) in a regression, and I ended up with a student whose project was a regression with 3 variables, and only 4 data points. Not surprisingly, we got an exact fit of the data, R^2 = 1 ! So, don't do that. When we have "n" data points and "m" variables, we're essentially trying to solve an n-by-m system of equations (and the system is a linear system if we're doing linear regression). We know that if you have n=m, then the system should have exactly one solution--that is, you get a perfect fit. If n>m (as it should be in any regression study), then there is no perfect solution to the n-by-m system of equations. We want the solution that minimizes the sum of squared errors in the system of equations. This is the basic intuition for why we want the # data points to be well above the # of variables--because we know the fit shouldn't be perfect, and we want to see some of the variability in the data. If the fit is perfect, then it will look like there's no variability. an interesting article is "Sample Size and Multiple Regression Analysis" by Scott E. Maxwell Psychological Methods, vol 5 number 4, 2000, 434-458 Let me know if you want a copy. ---------------------------------------- Over-Fitting Don't do it! Overfitting is trusting your current data too much, trying to get a curve that comes close to the current data at the expense of being able to make good predictions for new data. In particular, don't use high-order polynomials--I would be hesitant to use a cubic in most cases, and only rarely would I use anything of higher order than a cubic. Some of the only times I've seen a cubic as a good model are: * approximating the "rocket equation" (a Calc II integration-by-parts problem), * as a model for total cost as a function of # items produced. This is because a parabola might be a good model for marginal cost as a function of #items produced: decreasing to start with, then increasing. * as part of "cubic splines", a different topic altogether. ---------------------------------------- Cross-validation Cross-validation is the process of dividing our data into training and test sets, then using the test set to see how well various models can predict new data that they weren't fitted to. Imagine a data set of some economic variable from 1950 to 2010. We might divide it into training (1950-2000) and test (2001-2010), or we might divide it at 1990. Where to divide it is mostly art rather than science. We would then: * fit various models to the training set, * use those fitted models to predict values for the test set, * evaluate the accuracy of each fitted model, perhaps via sum-of-squared errors on the test set, * pick the fitted model with the lowest sum of squared errors on the test set, * re-fit that model to the entire data set (1950-2010) * make predictions for 2020, etc. We are essentially pretending that it's 10 or 20 years earlier than the present day (2010, as I write this), fitting models to the data available from that pretend date, then magically revealing the future and seeing which model did a good job predicting it. If the data aren't ordered in time, there are two popular options: * 50-percent hold-out: randomly choose 50% of the data points to be the training set, and the others to be the test set. Fit models, make predictions for the test set, compute sum-of-squared-errors for each fitted model. THEN, re-randomize which 50% are in the test/training sets, repeat. Repeat the randomization many, many times. * Leave-one-out cross-validation (LOOCV). Each data point gets its own turn being the test data set, while all the other data points are the training set. If you have a data set with 10,000 points, you will need to fit the models 10,000 times. training/test is sometimes called estimation/validation Or sometimes you need to divide the data into 3 sets, train/tune/test or training/validation/test http://matlabdatamining.blogspot.com/2008/03/validating-predictive-models.html Validating Predictive Models Will Dwinnell ---------------------------------------- Aggregating (avoid it!) If you have discrete data for your X variable (like ACT scores by person), it is tempting to find the average Y value at each possible level of the X variable, and then fit something to those averaged (aggregated) Y values. Don't do it! It is better to fit to the original, unaggregated data. This is for a few different reasons: * if you have more data at some X values than at others, your fitting should know about it, rather than treating all X values equally. * You want to keep some idea of the amount of variability at each X value. If you just use the average, it looks like you have no variability! ---------------------------------------- Ecological Fallacy If you discover a correlation between two things at the level of individual people, will that correlation still be there if you aggregate people by region (or any other aggregation method)? Maybe not! Or vice versa--a correlation at the region-by-region level might disappear, or reverse (!), at the individual level. The assumption that the correlations will be the same is the "ecological fallacy". http://en.wikipedia.org/wiki/Ecological_fallacy ---------------------------------------- Residual Plots Here are six recommended plots for examining the residuals: http://www.itl.nist.gov/div898/handbook/eda/section3/6plot.htm However, it neglects another type that it mentions elsewhere: a run-order or run-sequence plot. If your data points were collected one after another in time, then it's possible that some kind of warm-up or learning-curve is affecting the values, so you'd want to look for that. Actually, that web site overall is very interesting: the NIST/SEMATECH Engineering Statistics Handbook: http://www.itl.nist.gov/div898/handbook/ A classic paper on the subject is: The Examination and Analysis of Residuals F. J. Anscombe and John W. Tukey Technometrics Vol. 5, No. 2 (May, 1963), pp. 141-160