Points of Significance: Logistic regression

Journal name:
Nature Methods
Volume:
13,
Pages:
541–542
Year published:
DOI:
doi:10.1038/nmeth.3904
Published online

Regression can be used on categorical responses to estimate probabilities and to classify.

At a glance

Figures

  1. Classification of data requires thresholding, which defines probability intervals for each class.
    Figure 1: Classification of data requires thresholding, which defines probability intervals for each class.

    Shown are observations of a categorical variable positioned using the predicted probability of being in one of two classes, encoded by open and solid circles, respectively. Top row: when class membership is perfectly separable, a threshold (e.g., 0.5) can be chosen to make classification perfectly accurate. Bottom row: when separation between classes is ambiguous, as shown here with the same predictor values as for the row above, perfect classification accuracy with a single-value threshold is not possible. The threshold is tuned to control false positives (e.g., 0.75) or false negatives (e.g., 0.25).

  2. Robustness of classification to outliers depends on the type of regression used to establish thresholds.
    Figure 2: Robustness of classification to outliers depends on the type of regression used to establish thresholds.

    (a) The effect of outliers on classification based on linear regression. The plot shows classification using linear regression fit (solid black line) to the training set of those who play professional basketball (solid circles; classification of 1) and those who do not (open circles; classification of 0). When a probability cutoff of 0.5 is used (horizontal dotted line), the fit yields a threshold of 192 cm (dashed black line) as well as one false negative (FN) and one false positive (FP). Including the outlier at H = 100 cm (orange circle) in the fit (solid orange line) increases the threshold to 197 cm (dashed orange line). (b) The effect of outliers on classification based on step and logistic regression. Regression using step and logistic models yields thresholds of 185 cm (solid vertical blue line) and 194 cm (dashed blue line), respectively. The outlier from a does not substantially affect either fit.

  3. Optimal estimates in logistic regression are found iteratively via minimization of the negative log likelihood.
    Figure 3: Optimal estimates in logistic regression are found iteratively via minimization of the negative log likelihood.

    The slope parameter for each logistic curve (upper plot) is indicated by a correspondingly colored point in the lower plot, shown with its associated negative log likelihood. (a) A non-separable data set with different logistic curves using a single slope parameter. A minimum is found for the ideal curve (blue). (b) A perfectly separable data set for which no minimum exists. Attempts at a solution create increasingly steeper curves—the negative log likelihood asymptotically decreases toward zero, and the estimated slope tends toward infinity.

References

  1. Altman, N. & Krzywinski, M. Nat. Methods 12, 9991000 (2015).
  2. Krzywinski, M. & Altman, N. Nat. Methods 12, 11031104 (2015).
  3. Altman, N. & Krzywinski, M. Nat. Methods 13, 281282 (2016).

Download references

Author information

Affiliations

  1. Jake Lever is a PhD candidate at Canada's Michael Smith Genome Sciences Centre.

  2. Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.

  3. Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

Competing financial interests

The authors declare no competing financial interests.

Author details

Additional data