Quality Control Before Building a Polygenic Risk Score (PRS)

Appropriate quality control (QC) is essential when constructing a polygenic risk score (PRS). After obtaining the base data and target data, several QC steps should be completed before running PRS software. This article summarizes those steps based on the 2020 Nature Protocols paper, Tutorial: a guide to performing polygenic risk score analyses. Please refer to the original article for full details.

Glossary

Polygenic Risk Score (PRS)

A polygenic risk score (PRS) is a quantitative measure that integrates genetic risk across multiple loci. It is calculated by combining an individual's genotype data with the effect sizes estimated for each locus in a genome-wide association study (GWAS).

Quality Control (QC)

Quality control (QC) refers to procedures used to ensure data reliability. Before calculating a PRS, both the base data and target data should be checked for missingness, outliers, allele inconsistencies, and other potential issues.

Base Data

Base data are summary statistics obtained from a GWAS. Because these data provide the weights used in PRS construction, proper QC is critical.

Target Data

Target data are the genotype data of the individuals to whom the PRS will be applied. These are individual-level data, often managed in PLINK format, and they also require their own QC procedures.

QC for Base Data (GWAS Summary Statistics)

Check SNP Heritability

It is generally recommended that the SNP heritability (h²SNP) of the GWAS summary statistics be at least 0.05. If it is below 0.05, careful consideration is needed before proceeding with PRS construction.

Review the original GWAS paper and check the reported SNP heritability.
If it is not reported, estimate it using methods such as LD Score Regression or SumHer.

In large consortium-based GWAS, SNP heritability below 5% is relatively uncommon. In contrast, caution is needed when using smaller in-house GWAS datasets.

Confirm the Effect Allele

The effect allele in the base data must be clearly identified.

Identify the effect allele in the base data.
If it is not explicitly stated, contact the GWAS authors.

In many cases, the column names in the GWAS summary statistics provide enough information to infer this. If a README file is provided with the summary statistics, that should also be checked.

QC for Target Data (Genotype Data of the Analysis Cohort)

Sample Size

A minimum of 100 individuals, or an effective sample size of at least 100 in case-control studies, is generally recommended. With very small samples, QC becomes less reliable and PRS results may be unstable.

Check the size of the target dataset.
Assess whether it is sufficient for both QC and downstream statistical analyses.

Of course, PRS-based association analyses also require an adequate sample size, though it is equally important to secure a minimum sample size even at the QC stage.

QC for Both Base and Target Data

Check File Integrity After Transfer

Care should be taken to ensure that files have not been corrupted during download or transfer.

Check for file corruption after downloading or copying.
Tools such as md5sum can be useful.

Harmonize Genome Build

Confirm that the genome build is consistent between the base and target data, for example hg19 or hg38.

Check whether the base and target datasets use the same genome build.
If not, convert one dataset using LiftOver.

Perform Standard GWAS QC

Recommended filtering thresholds include:

genotyping rate > 0.99
sample missingness < 0.02
Hardy-Weinberg equilibrium test P value > 1×10⁻⁶
heterozygosity within ±3 SD of the mean
minor allele frequency (MAF) > 1%
- use 5% when the sample size is small
imputation quality (info score) > 0.8

In practice:

Check how QC was performed for the base data, and apply additional QC if necessary.
Perform QC on the target data using tools such as PLINK.

For downloaded GWAS summary statistics, the necessary QC has often already been completed as part of the original GWAS. The cutoffs and QC criteria used in that study may differ from those recommended here. If the original GWAS applied sufficiently rigorous QC, the data can usually be used as they are.

In contrast, more attention is often needed for the target data. As with a standard GWAS, appropriate QC should generally be completed in advance. Genotype data obtained from a biobank may already have undergone QC for missingness, heterozygosity, and related metrics, so it is important to confirm which QC steps have already been performed.

PRS-Specific QC

Remove Ambiguous SNPs

Because DNA is double-stranded, allele labels can be reversed depending on which strand is being read, for example A↔T and C↔G. SNPs with A/T or C/G alleles are considered ambiguous because strand differences can make allele matching between base and target data uncertain. To avoid reversing the risk direction during PRS calculation, these SNPs should be removed from the base data.

Exclude A/T and C/G SNPs from the base data.

Handle Mismatching SNPs

If allele coding differs between the base and target data, the SNP should be handled as follows:

If the alleles match after strand flipping, for example A/C vs G/T, flip and align them.
If they still do not match after flipping, for example C/G vs C/T, remove the SNP.

Many PRS tools, including PRSice and PLINK, automatically perform strand flipping or SNP removal when needed. Even so, it remains important to confirm that this has been handled correctly.

Practical Note

QC for ambiguous SNPs and mismatching SNPs is not usually part of standard GWAS QC, so special attention is needed. These are PRS-specific QC procedures and should be understood and handled appropriately.

In principle, these QC steps only need to be applied to the base data. They are based on allele patterns and do not depend on the actual genotype values in the target dataset. In addition, SNPs that are absent from the base data are automatically excluded from the target data during PRS construction, so no separate handling is usually needed on the target side.

In practice, this can often be handled simply by removing SNPs in the base data whose allele pairs are A/T, T/A, C/G, or G/C.

Other Important QC Items

Remove Duplicate SNPs

The same SNP appearing multiple times can cause errors during PRS calculation.

Remove duplicates from the base data using shell scripts or R.
Remove duplicates from the target data using PLINK or similar tools.

QC for Sex Chromosomes

If self-reported sex does not match sex inferred from the X chromosome, the sample should be excluded. In most cases, SNPs on the sex chromosomes (X and Y) are excluded from PRS calculation.

Use PLINK --check-sex and exclude mismatched samples if necessary.

Sex checks should be performed for both the base and target data. They can also help detect sample mix-ups.

Sample Overlap

If the same individuals are included in both the base and target datasets, the estimated PRS effect can be inflated.

If sample overlap is suspected, exclude the target participants from the base data.

Remove Related Individuals

If related individuals are present across the base and target datasets, PRS effect estimates may be inflated. This is particularly important for first- and second-degree relatives, because they are genetically and environmentally similar, making it more difficult to distinguish genetic effects from shared environmental effects.

Estimate relatedness between samples in the base and target datasets and exclude related individuals.
If that is not feasible, use samples that are as independent as possible.

Final Thoughts

The level of PRS quality required depends on the purpose of the study. If the goal is to advance PRS methodology or statistical genetics itself, the QC steps described here should be regarded as the minimum standard. In contrast, for studies focused on clinical application, this level of QC may often be sufficient.

Naturally, higher-quality QC is always preferable. The key is to strike an appropriate balance based on the study objective and prior literature.

Finally, required QC steps may differ depending on the PRS algorithm or software being used. Be sure to review the documentation and recommendations for the specific tool you plan to use.

Genomic Research for Clinical Researchers ─ PRS QC Explained