PCR (Principal Component Regression)¶
Description¶
The PCR metric evaluates how well continuous biological variation is preserved after data integration in single-cell genomics. It is particularly useful for assessing whether continuous structures—such as cell cycle progression or developmental trajectories—remain intact after batch effect correction. The method involves performing a linear regression between the principal components of the integrated data and a known continuous biological variable (i.e. cell cycle score).
Formulas¶
Principal Component Regression (PCR) combines Principal Component Analysis (PCA) with linear regression:
-
PCA step : Apply PCA to the integrated data matrix \(X\) (cells × genes) to obtain a set of orthogonal principal components (PCs), denoted \(Z = XW\), where \(W\) is the matrix of eigenvectors.
-
Component selection: Select the top \(k\) PCs that capture most of the variance.
-
Regression step : Fit a linear regression model using the selected PCs \(Z_k\) as predictors for a continuous biological variable \(y\) (e.g., cell cycle score).
-
Metric score : Compute the coefficient of determination \(R^2\) from the regression:
Where:
-
\(y_i\) is the true value of the biological variable for cell \(i\),
-
\(\hat{y}_i\) is the predicted value from the regression,
-
\(\bar{y}\) is the mean of all \(y_i\) values.
A score close to 1 indicates strong preservation of the biological signal in the integrated space.