Skip to navigation Skip to main content Skip to footer

Why have we sequenced half a million genomes?

What is whole genome sequencing?

DNA sequencing means reading DNA to determine the order of the four 'letters', called base pairs, which make up DNA (A,T,G and C).

Whole genome sequencing is reading the genetic sequence for a person's entire genetic code: the unique genetic code of three billion building blocks that makes each individual who they are.

Both the DNA which forms instructions for producing proteins (the exome, around 2% of the genome) and the 'non-coding' region (the remaining 98%) have been sequenced for each participant.

UK Biobank's whole genome dataset is twice as big as any other

How will the sequencing data help human health?

The data will equip researchers with the tools to make previously impossible discoveries about how diseases develop, and how we can diagnose, prevent and treat them.

Whole genome sequencing data on this scale will allow researchers to explore rare and hidden genes involved in health and disease. They will be able to find patterns in the data that might be missed in a smaller dataset.

Combining this genetic information with the other data in UK Biobank – from health records, to lifestyle information and imaging data – provides researchers with unprecedented insight into how our genes, behaviours and environments play a role in our health.

Deciphering the genetic causes of disease & improving diagnoses 

98% of our genome does not produce proteins. However many regions of this non-coding DNA regulate whether coding genes (genes which do produce proteins) are turned on or off. 

Genetic variation in the non-coding DNA can lead to too little or too much of particular proteins in the blood, causing disease.

This means that changes to blood protein levels can indicate the presence of disease. Scientists can use these proteins as 'biomarkers' to help diagnose disease, understand their causes and monitor their progression.

Achieving a better understanding of genetic diseases

Many serious and rare diseases are caused by single, rare genetic changes which are difficult to detect in smaller populations. However, the scale of the whole genome sequencing data means that the likelihood of finding these rarer variants is increased.

Diagnosing rare diseases can be challenging where there is a shortage of information for individuals and their families.

Take Huntington’s disease, which cannot be cured or slowed down. With previous sequencing data (exome and genotyping) it was impossible to study the genomic region involved in causing this disease in large populations of people prior to disease development. With these new data, this is now possible and may permit both earlier intervention and more timely diagnoses for at-risk individuals. 

Developing more effective medicines 

UK Biobank’s whole genome sequencing data will help to develop new medicines for a spectrum of diseases including heart disease, type 2 diabetes, rare genetic diseases and cancers.

Researchers can discover potential drug targets by exploring genetic variation in individuals who are at high risk of a particular disease. A drug target might be a specific gene which usually writes the instructions to make a protein involved in a key biological process.

However, a change in this gene (a mutation) could result in the production of a faulty protein which leads to disease. Medications can be developed to target and silence this mutated gene - this is known as precision medicine.

Personalising risk prediction of major diseases

The scale of the whole genome sequencing data allows researchers to understand disease risk in greater detail than ever before. Scientists will be able to look at the cumulative risk from lots of different genetic variants which together can have dramatic effects on disease risk. 

In doing so, they can more easily identify individuals who have a genetically higher risk for a particular disease or trait. The early identification of at-risk individuals for certain diseases can guide preventative measures (e.g. lifestyle modifications or medication).

Great potential lies in using the data to predict the risk of many diseases, such as breast cancer, colorectal cancer, asthma, rheumatoid arthritis and osteoporosis. 

Sequencing half a million genomes took over 350,000 hours

How is the data made available?

The sequencing data is made available to researchers in a de-identified form. Information that identifies our participants – such as names, addresses and NHS numbers – is never shared with researchers.

Where is the data stored?

The data is accessible via the UK Biobank Research Analysis Platform, a powerful and secure online environment where approved researchers can analyse the data without downloading it.


Who can access the data?

UK Biobank data is accessible only to approved researchers who have undergone a stringent vetting process, regardless of whether they work for a university, charity or company. This process also ensures that they are conducting studies for the public benefit and to advance health research. 

What genetic data has UK Biobank released before?

July 2017: genotyping data

A genotype describes which DNA 'letter' is found in a specific place in the genome.

More than 800,000 carefully selected genotypes were measured in the DNA of all 500,000 UK Biobank participants, and a further 90 million other genotypes were estimated for each participant. These data enabled researchers to see how specific regions of the genome differed between participants.

July 2022: whole exome sequencing data

The final set of whole exome sequencing data was released for 470,000 participants in 2022.

The exome is the coding portion of the genome (around 2% in total). Whole exome sequencing involves deciphering the order of DNA letters which make up the genes in this region. 

Understanding the interactions between genes and proteins is important because genetic mutations in the exome can produce faulty proteins which may lead to disease.

The more researchers learn about how and why these faulty proteins disrupt biological processes, the more information they have to develop drugs to target mutated genes.

However, is it important to sequence the whole genome, not just the exome, because variants implicated in disease are found in the non-coding region. These include many rarer variants which may not be detected by whole exome sequencing.

If whole exome sequencing is reading one chapter of a book, whole genome sequencing is reading the whole book to discover where the story will go!

How did the project work?

Since the completion of the Human Genome Project in 2003, technological advancements have made whole genome sequencing much faster, easier and more cost-effective.

However, UK Biobank's whole genome sequencing project still took five years from start to finish.

Around half of the participant blood samples in UK Biobank (which contained DNA) travelled to Iceland to be sequenced by deCODE Genetics, whilst the other half were sent to Cambridge and were sequenced by the Wellcome Sanger Institute. 

The £230 million project was funded by a consortium including the government, charity and industry. This included Wellcome, UK Research and Innovation and four biopharmaceutical companies: Amgen, AstraZeneca, GSK and Johnson & Johnson.

On average, the genome of every participant in UK Biobank was sequenced 30 times to limit errors

"Sequenced data was a vital piece of the health jigsaw that scientists had hoped for, but never imagined would come so quickly."

UK Biobank's CEO, Professor Sir Rory Collins 

Last updated