Logistic Regression

by John C. Pezzullo, instruction modifications by Kevin M. Sullivan

Version 05.07.20

NOT BY ANDREW ROSS, even though he is hosting a mirror of it on his own web page. Original URL is http://statpages.org/logistic.html

This page performs logistic regression, in which a dichotomous outcome is predicted by one or more variables. The program generates the coefficients of a prediction formula (and standard errors of estimate and significance levels), and odds ratios (with 95% confidence intervals).



Instructions:

  1. Enter the number of data points: (or, if summary data, the number of lines of data).

  2. Enter the number of predictor variables:

  3. If summary data, check here

  4. Type or paste data in the Data Window below (see lower section on page concerning issues on data formatting)

Data Window


 

  1. Click the button; results will appear in the Results Window below:

Results Window


  1. To print out results, copy (Ctrl-C) and paste (Ctrl-V) the contents of the Results Window to a word processor or text editor, then print the results from that program. For best appearance, specify a fixed-width font like Courier.


Data Examples

A number of examples are provided on the format to enter data.  All examples are based on the Evans County data set described in Kleinbaum, Kupper, and Morgenstern, Epidemiologic Research: Principles and Quantitative Methods, New York: Van Nostrand Reinhold, 1982.  The Evans County study was a cohort study of men followed for 7 years.  The files are also available as text files to allow the user to cut and paste the example data into the Data Window.

Data can be in two formats - records at the individual level (one record for each individual or whatever the unit of analysis) or the data could be summary information, such as the number of individuals at an exposure level without disease and the number with disease.  The data on one line must be separated by a tab or a comma; the examples below use the comma to separate data points.  These examples first describe data at the individual level, and then describe summary data.

Data at the individuals level, one exposure variable

Enter or paste the data into the Data Window a dichotomous exposure variable (coded as 1 for exposed and 0 for unexposed) and the outcome variable (coded as 1 for with the outcome and 0 for without the outcome) with the two variables separated by a "," or a tab.  For example, in assessing the relationship between an elevated catecholamine level (the exposure of interest, 1= elevated and 0= normal) and coronary heart disease (CHD, the outcome of interest), the records would be formatted as numeric values for:

exposure variable value, outcome variable value

For this example data the number of data points is 609 and the number of predictor variables is 1.  The first 10 records from the example data are shown below:

0, 0
0, 0
1, 1
1, 0
0, 0
0, 0
0, 1
0, 0
0, 0
0, 0 

... (plus 599 additional lines)

 

The full data file as a text file can be found here.  The results of the analysis would be:

Odds Ratios and 95% Confidence Intervals...
Variable  O.R.    Low -- High
1         2.8615 1.6878 4.8514

 

The interpretation would be that individuals with elevated catecholamine levels have a 2.8615 greater odds of developing CHD compared to individuals with normal catecholamine levels.

[A note on coding the exposure variable:  The above example coded the exposed as 1 and unexposed as 0, and the odds ratio was calculated  comparing the odds of being coded as 1 to being coded as 0 - note that those coded as 0 are the referent group.  If you code the exposure as 1 and 2, the smaller number will be treated as the referent group, which in this example is 1.  The odds ratio for a 2/1 coding scheme would be the odds of disease for those coded as 2 compared to the odds in those coded as 1.]

[A note on coding the outcome variable: The outcome variable must be coded as 1 for those with the outcome and 0 for those without the outcome.]

If the exposure variable is continuous, you can use the numeric value (which assumes the relationship is linear on a logit scale).  For example, in assessing the relationship between age and CHD, the number of data points is 609 and the number of predictor variables is 1, and the first ten records would look like as shown below (data as a text file can be found here):

56, 0
43, 0
56, 1
64, 0
49, 0
46, 0
52, 1
63, 0
42, 0
55, 0

... (plus 599 additional lines)

The results of the analysis would be:

Odds Ratios and 95% Confidence Intervals...
Variable  O.R.    Low -- High
1         1.0454 1.0189 1.0727

The interpretation would be that for every one year increase in age, the odds of CHD increased by a factor of 1.0454 (or by about 4.5%).

Data at the individuals level, two exposure variables - no interaction model

If there is more than one exposure variable, list the exposure variables first and the outcome variable last.  For example, say the investigator wants to determine the simultaneous effect of catecholamine and cigarette smoking (1=smoker, 0=nonsmoker) on CHD, the data would be:

first exposure variable value, second exposure variable value, outcome variable value

For this example data the number of data points is 609 and the number of predictor variables is 2.  The first 10 records from the example data are shown below with the variable being catecholamine, smoking, and CHD and the data in a text file is here:

0, 0, 0
0, 1, 0
1, 1, 1
1, 1, 0
0, 1, 0
0, 1, 0
0, 1, 1
0, 0, 0
0, 1, 0
0, 0, 0

... (plus 599 additional lines)

The results of the analysis would be:

Odds Ratios and 95% Confidence Intervals...
Variable   O.R.   Low -- High
1         2.9074 1.7079 4.9492
2         2.0000 1.1206 3.5695

The interpretation would be that individuals with an elevated catecholamine level ("Variable 1" in the above output) have an odds of CHD about 2.9 times greater than those with normal catecholamine levels controlling for cigarette smoking.  Cigarette smokers ("Variable" 2 in the above output) have twice the odds (2.0) of CHD compared to nonsmokers controlling for catecholamine (elevated vs. normal).

Data at the individuals level, two exposure variables - interaction model

If you would like to assess the interaction between two variables, there will need to be an interaction term.  Using the data from the previous example, the question might be whether cigarette smoking modifies the catecholamine->CHD relationship.  The interaction term is simply multiplying the value for catecholamine times the value for smoking, of which there are only four possibilities with these two variables:
Catecholamine   Smoking   Interaction
1 x 1 = 1
1 x 0 = 0
0 x 1 = 0
0 x 0 = 0

The data would be in the following format:

first exposure variable value, second exposure variable value, interaction value, outcome variable value

For this example data the number of data points is 609 and the number of predictor variables is 3.  The first 10 records from the example data are shown below with the variables being catecholamine, smoking, the catecholamine-smoking interaction, and CHD and the data file as text can be found here:

0, 0, 0, 0
0, 1, 0, 0
1, 1, 1, 1
1, 1, 1, 0
0, 1, 0, 0
0, 1, 0, 0
0, 1, 0, 1
0, 0, 0, 0
0, 1, 0, 0
0, 0, 0, 0

... (plus 599 additional lines)

The results of the analysis would be:

Coefficients and Standard Errors...
Variable   Coeff. StdErr    p
1          1.3953 0.5187 0.0072
2          0.8653 0.3864 0.0251
3         -0.4498 0.6092 0.4603
Intercept -2.9267

Odds Ratios and 95% Confidence Intervals...
Variable    O.R.   Low -- High
1          4.0360 1.4601 11.1562
2          2.3758 1.1141 5.0661
3          0.6377 0.1932 2.1049

The interpretation would be that the interaction is not statistically significant (p-value for variable 3 = 0.4603) and could be removed from the model.  Another way to tell that the interaction is not significant is based on the odds ratio confidence interval for the interaction term; the null value (when there is no interaction) for an interaction term is 1; the 95% confidence interval for the odds ratio around the interaction term goes from 0.1932 to 2.1049 which includes the "null value" of 1.

Summary data, one exposure variable

This program can also analyze summary data.  For example, the table below summarizes information on 609 individuals by exposure (catecholamine) and disease (CHD):
Elevated Catecholamine? CHD (Disease variable)
(Exposure variable) Yes (1) No (0)
     Yes (1) 27 95
     No (0) 44 443

The data can be entered as summary data in two lines in the format:

exposure variable level, number without disease at this exposure level, number with disease at this exposure level

For this example data the number of data points is 2, the number of predictor variables is 1, and check the summary data box.  The complete example data are shown below with the variable being exposure category, number without CHD in exposure category, and number with CHD in exposure category.  You could copy these data and paste them in the Data Window.

1, 95, 27
0, 443, 44

The results of the analysis would be as follows, exactly the same as the Data at the individuals level, one exposure variable example shown previously based on the same data.

Odds Ratios and 95% Confidence Intervals...
Variable  O.R.    Low -- High
1         2.8615 1.6878 4.8514

Summary data, two exposure variables

 In this example is described a situation where there are two exposure levels, one considered as the primary exposure of interest and another as potentially an effect modifier, confounder, significant independent exposure, or none of these.  As an example, an investigators are interested in the relationship between an elevated catecholamine and CHD, but want to determine if this relationship is affected by the smoking status of the individual.  The data are as follows:

Smoke = Yes (1)
Elevated Catecholamine? CHD (Disease variable)
(Exposure variable) Yes (1) No (0)
     Yes (1) 19 58
     No (0) 35 275

Smoke = No (0)
Elevated Catecholamine? CHD (Disease variable)
(Exposure variable) Yes (1) No (0)
     Yes (1) 8 37
     No (0) 9 168

First, to see if smoking modifies the catecholamine->CHD relationship, enter data to determine if the interaction between catecholamine and smoking is statistically significant.  The interaction level would be determined similarly to that described previously.

exposure variable 1 level, exposure variable 2 level, interaction level, number without disease at this level, number with disease at this level.

For this example data the number of data points is 4, the number of predictor variables is 3, and check the summary data box.  The complete example data are shown below with the variables being cateholamine category, smoking category, interaction category, number without CHD at these levels, and number with CHD at these levels.  You could copy these data and paste them in the Data Window.

1, 1, 1, 58, 19
0, 1, 0, 275, 35
1, 0, 0, 37, 8
0, 0, 0, 168, 9

The results of the analysis would be:

Coefficients and Standard Errors...
Variable   Coeff. StdErr    p
1          1.3953 0.5187 0.0072
2          0.8653 0.3864 0.0251
3         -0.4498 0.6092 0.4603
Intercept -2.9267

Odds Ratios and 95% Confidence Intervals...
Variable    O.R.   Low -- High
1          4.0360 1.4601 11.1562
2          2.3758 1.1141 5.0661
3          0.6377 0.1932 2.1049

The interpretation would be that the interaction is not statistically significant (p-value for variable 3 = 0.4603) and could be removed from the model. 

To determine whether smoking confounds the catecholamine->CHD association, two odds ratios are needed, a "crude" odds ratio from a logistic regression model with just catecholamine as a predictor of CHD which was 2.8615, and a logistic regression model with two predictors in the model, catecholamine and smoking.  The general format for the summary data is:

exposure variable 1 level, exposure variable 2 level, number without disease at this level, number with disease at this level

For this example data the number of data points is 4, the number of predictor variables is 2, and check the summary data box.  The complete example data are shown below with the variables being cateholamine category, smoking category, number without CHD at these levels, and number with CHD at these levels.  You could copy these data and paste them in the Data Window.

1, 1, 58, 19
0, 1, 275, 35
1, 0, 37, 8
0, 0, 168, 9

The results of the analysis would be:

Odds Ratios and 95% Confidence Intervals...
Variable   O.R.   Low -- High
1         2.9074 1.7079 4.9492
2         2.0000 1.1206 3.5695

The interpretation would be that individuals with an elevated catecholamine level ("Variable 1" in the above output) have an odds of CHD 2.9074 times greater than those with normal catecholamine levels controlling for cigarette smoking.  Cigarette smokers ("Variable" 2 in the above output) have twice the odds (2.0000) of CHD compared to nonsmokers controlling for catecholamine (elevated vs. normal).  For the question of whether or not smoking confounds the catecholamine->CHD association, compare the crude odds ratio (2.8615) with the odds ratio adjusted for smoking (2.9074) - as a general rule, if these two differ by 10% or more, then confounding is present; if less than 10%, there is not an important amount of confounding.  (Note that some investigators may choose to define confounding differently, perhaps at a 5% difference.)  In this example, there is little evidence of confounding.  However, smoking does seem to be an important independent predictor of CHD when controlling for catecholamine.


Questions or Problems?

*** Not getting correct results or blank results?

If you are not getting numeric results or an error message, please assure the following:

*** One (or more) of my coefficients came out very large (and the standard error is even larger!). Why did this happen?

This is probably due to what is called "the perfect predictor problem". This occurs when one of the predictor variables is perfectly divided into two distinct ranges for the two outcomes. For example, if you had an independent variable like Age, and everyone above age 50 had the outcome event, and everyone 50 and below did not have the event, then the logistic algorithm will not converge (the regression coefficient for Age will take off toward infinity). The same thing can happen with categorical predictors. And it gets even more insidious when there's more than one independent variable. None of the variables by themselves may look like "perfect predictors", but some subset of them taken together might form a pattern in n-dimensional space that can be sliced into two regions where everyone in one region had outcome=1 and everyone in the other region had outcome=0. This isn't a flaw in the web page; it's actually a situation where the logistic model is simply not appropriate for the data. The true relationship is a "step function", not the smooth "S-shaped" function of the logistic model.)

*** How do I copy and paste data?

Copy data:  In most programs, you identify the data you want to copy then go to Edit->Copy

Paste data: Open this logistic regression program; place the cursor in the Data Window and highlight the example data, then, in Windows, simultaneously press the Ctrl and V keys; Mac users press the Command and V keys.

*** Can I copy and paste from Excel?

Yes, highlight the columns with the data, Edit->Copy the data, and paste into the Logistic Data Window.  Note that when you paste data from Excel into the Data Window, the different columns of data will be separated by a tab.  You cannot see the tab in the Data Window, but you can usually tell the difference between a tab and blank spaces by placing the cursor in a line of data, then move the cursor to the right one space of a time - a tab will make the cursor move many spaces.


Background Info (just what is logistic regression, anyway?):

Ordinary regression deals with finding a function that relates a continuous outcome variable (dependent variable y) to one or more predictors (independent variables x1, x2, etc.). Simple linear regression assumes a function of the form:
y = c0 + c1 * x1 + c2 * x2 +...
and finds the values of c0, c1, c2, etc. (c0 is called the "intercept" or "constant term").

Logistic regression is a variation of ordinary regression, useful when the observed outcome is restricted to two values, which usually represent the occurrence or non-occurrence of some outcome event, (usually coded as 1 or 0, respectively). It produces a formula that predicts the probability of the occurrence as a function of the independent variables.

Logistic regression fits a special s-shaped curve by taking the linear regression (above), which could produce any y-value between minus infinity and plus infinity, and transforming it with the function:
p = Exp(y) / ( 1 + Exp(y) )
which produces p-values between 0 (as y approaches minus infinity) and 1 (as y approaches plus infinity). This now becomes a special kind of non-linear regression, which is what this page performs.

Logistic regression also produces Odds Ratios (O.R.) associated with each predictor value. The odds of an event is defined as the probability of the outcome event occurring divided by the probability of the event not occurring. The odds ratio for a predictor tells the relative amount by which the odds of the outcome increase (O.R. greater than 1.0) or decrease (O.R. less than 1.0) when the value of the predictor value is increased by 1.0 units.


Techie-stuff (for those who might be interested):

This page contains a straightforward JavaScript implementation of a standard iterative method to maximize the Log Likelihood Function (LLF), defined as the sum of the logarithms of the predicted probabilities of occurrence for those cases where the event occurred and the logarithms of the predicted probabilities of non-occurrence for those cases where the event did not occur.

Maximization is by Newton's method, with a very simple elimination algorithm to invert and solve the simultaneous equations. Central-limit estimates of parameter standard errors are obtained from the diagonal terms of the inverse matrix. Odds Ratios and their confidence limits are obtained by exponentiating the parameters and their lower and upper confidence limits (approximated by +/- 1.96 standard errors).

No special convergence-acceleration techniques are used. For improved precision, the independent variables are temporarily converted to "standard scores" ( value - Mean ) / StdDev. The Null Model is used as the starting guess for the iterations -- all parameter coefficients are zero, and the intercept is the logarithm of the ratio of the number of cases with y=1 to the number with y=0. The quantity -2*Ln(Likelihood) is displayed for the null model, for each step of the iteration, and for the final (converged model). Convergence is not guaranteed, but this page should work properly with most practical problems that arise in real-world situations.

This implementation has no predefined limits for the number of independent variables or cases. The actual limits are probably dependent on your web browser's available memory and other browser-specific restrictions.

The fields below are pre-loaded with a very simple example.

Notes: John Pezzullo wrote the program and the Instructions, Background Info, and Techie-Stuff sections; Kevin Sullivan modified the Instructions slightly and wrote the Data Examples sections.

Reference: Applied Logistic Regression, by D.W. Hosmer and S. Lemeshow. 1989, John Wiley & Sons, New York



Return to the Interactive Statistics page or to the JCP Home Page