Author + information
- Received December 1, 2014
- Accepted December 19, 2014
- Published online March 3, 2015.
- Johan L.M. Björkegren, MD, PhD∗,†,‡,§∗ (, )
- Jason C. Kovacic, MD, PhD†,
- Joel T. Dudley, PhD∗ and
- Eric E. Schadt, PhD∗
- ∗Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York
- †Cardiovascular Institute, Icahn School of Medicine at Mount Sinai, New York, New York
- ‡Division of Vascular Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden
- §Department of Pathological Anatomy and Forensic Medicine, University of Tartu, Tartu, Estonia
- ↵∗Reprint requests and correspondence:
Dr. Johan L.M. Björkegren, Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1498, New York, New York 10029-6574.
Genome-wide association studies (GWAS) have been extensively used to study common complex diseases such as coronary artery disease (CAD), revealing 153 suggestive CAD loci, of which at least 46 have been validated as having genome-wide significance. However, these loci collectively explain <10% of the genetic variance in CAD. Thus, we must address the key question of what factors constitute the remaining 90% of CAD heritability. We review possible limitations of GWAS, and contextually consider some candidate CAD loci identified by this method. Looking ahead, we propose systems genetics as a complementary approach to unlocking the CAD heritability and etiology. Systems genetics builds network models of relevant molecular processes by combining genetic and genomic datasets to ultimately identify key “drivers” of disease. By leveraging systems-based genetic approaches, we can help reveal the full genetic basis of common complex disorders, enabling novel diagnostic and therapeutic opportunities.
- atherosclerotic plaque
- genome-wide association study
- myocardial infarction
- primary prevention
- regulatory gene networks
Heritability is the proportion of observed differences in a trait among individuals that are due to genetic differences chiefly believed to be identifiable in DNA. In addition, environmental factors including random chance also combine with genetic factors to contribute to traits, like phenotypes of a disease. Traditionally, genetic and environmental factors have been viewed and studied largely as 2 independent entities contributing to disease. In reality, the interaction of environmental and genetic factors to cause a trait or a disease is significant. In fact, the presence or absence of an environmental factor may determine whether or not a genetic factor will contribute to disease. Genome-wide association studies (GWAS) that arose from the successes of using family and linkage studies to understand causative variants for rare disorders have been successful to define several hundred genome-wide significant loci for complex diseases. However, despite meta-analyses of many GWAS, the overall contribution of identified loci to disease variation in the population is frequently <10%. In the current review, we critically review GWAS, particularly from the perspective of their possible limitations to identify genetic factors that are environmentally dependent. Using coronary artery disease (CAD) as the example, we then propose that by integrating existing GWA datasets with systems genetics approaches, we may have a path forward toward a more complete understanding of complex diseases including their heritability.
Genetics of rare single-gene disorders—the foundations of the search for heritability of common disorders
The heritability of traits between generations is principally carried in deoxyribonucleic acid (DNA), mainly as single nucleotide polymorphisms (SNPs), insertions, deletions, or copy number variants. Another type of heritability, believed to be independent of DNA (1,2), is carried by epigenetic mechanisms, which are attributed with inducing major shifts in DNA transcription (3). Epigenetics is not a topic of this paper, but has been reviewed extensively elsewhere (4). In the case of rare disorders, most carriers of 1 or more highly penetrant risk variant develop the disease at some point (Figure 1A). However, it is also common that some persons carrying a potentially causative variant will not develop that disorder; thus, the penetrance is typically <100%. Penetrance is a time-dependent aspect of phenotypic disease expression. For example, both Huntington disease and cystic fibrosis are virtually 100% penetrant: cystic fibrosis usually soon after birth and Huntington disease by about 70 years of age. In contrast, familial breast cancer–associated mutations in the BRCA1 gene have a lifetime penetrance of 60% to 85%. That risk variants do not necessarily lead to manifest disease in all carriers has been highlighted as a possible path to identify genetic and environmental mechanisms that confer resistance to certain rare diseases (5). Mechanisms that buffer against disease could be potential therapeutic targets.
Despite the high penetrance of disease-causing variants, rare diseases are, by definition, rare. As a result of differences in genetic ancestry, the spectrum and relative frequencies of disease-associated alleles vary among different populations. For example, cystic fibrosis is most common in populations originating from northern or western Europe (∼1 in 2,000), whereas sickle-cell anemia is more common in African or Afro-Caribbean populations (∼1 in 3,000). Through careful characterization of symptoms and disease phenotypes in families carrying rare disorders, the causes of nearly one-half of the estimated 7,000 single-gene disorders have been identified, mainly by using markers found at increased density across the genome and by linkage analysis in pedigrees of disease-carrying families (6). Although these were key techniques in the discovery of highly-heritable single-gene disorders that typically only affect a limited number of biological pathways and tissues, their scope is inadequate for genetic studies of complex disorders such as atherosclerosis and CAD, where disease inheritance is blended with environmental risk factors, and causative genes are likely to be operative across several tissues (7).
Genetics of common complex disorders—genome-wide association studies
The successful identification of the genetic causes and mechanisms of many rare single-gene disorders inspired scientists to use a similar approach to study common complex disorders. Because such diseases are widespread, linkage analysis within families was not believed to be appropriate. Furthermore, it was expected (and now confirmed) that many genetic signals underlie common complex disorders, each with a relatively weak effect (e.g., odds ratio: <1.5). A study design was instead chosen that analyzed increasingly dense genomic markers in thousands of mostly unrelated individuals in case-control association studies. From these lines of argument, the GWAS design was born (8).
One of the first GWAS of CAD came from the Wellcome Trust Case Control Consortium, which discovered that the chromosome 9 locus was associated with CAD (9–11). Since then, 153 suggestive DNA variants associated with CAD have been discovered by GWAS, of which 46 were replicated in meta-analyses of genome-wide association (GWA) datasets (12). These CAD-associated loci are strikingly pervasive across the population, but generally have weak effects. As recently reviewed (8), 50% of the CAD-associated variants occur in over one-half of the population, and at least 25% occur in over 75% of the population. However, each variant usually confers a minimal to modest increase in relative risk, averaging only 18% (corresponding to an odds ratio of 1.18). A common theme of recent GWAS reviews is the success of this approach, both for CAD and for other, more or less complex disorders, for which more than 1,000 loci have been identified (13). Certainly, the sheer number of loci discovered, which has also led to the discovery of many previously unknown disease-causing genes, is a major success. Nevertheless, for most common complex disorders, the combined contribution of these loci to disease variation in the population is frequently <10%. Indeed, the 153 known CAD-associated variants explain <10.6% of the likely genetic variation across the population (12). Thus, ∼90% of the heritability of CAD, and of most other common complex disorders, remains unexplained by identified GWA loci, despite the inclusion of remarkably large numbers of subjects: a recent meta-analysis of several GWAS for CAD comprised 63,746 cases and 130,681 controls (12). Although a GWAS-based approach will clearly not reveal the full extent of the heritability of CAD and other common complex disorders, more complete sequencing techniques, particularly expanded whole-exome/whole-genome sequencing (WES/WGS), typically applied to the same case-control cohorts previously used in GWAS, promise to reveal additional rare risk variants, perhaps with larger effects on heritability (14). These results and additional refinements of the analysis of existing GWAS (14) will contribute to reducing the large fraction of missing heritability, but to what extent?
When addressing the “missing heritability” it is also reasonable to question the reliability of CAD heritability estimates. Because the overall fraction of genetic variance in CAD is less than we believe, is missing heritability largely (∼90%) exaggerated? From traditional analysis of family pedigrees in twins, the range of genetic variance in CAD is between 40% and 60% (15). Assuming 40% heritability, the 153 genome-wide significant SNPs explain 10.6% of CAD variability. Another way of assessing heritability using GWA datasets is to consider all measured SNPs (16). When applying this “polygenic” model to the complex trait of height, 294,831 SNPs were found to explain as much as 45% of height variance (16). The authors concluded that because individual effects are too small to pass the stringent significance tests (p < 10−8) traditionally used in GWAS, most height heritability is not “missing,” but has simply not been detected in the GWA data. Thus, the range of 40% to 60% CAD heritability is probably reasonable. However, the notion of completely independent genetic and environmental risk factors needs reconsideration, as most genetic risk factors increasingly appear to be dependent on environmental influences (17).
Genetics of common complex disorders—the search for missing heritability
Some of the missing heritability of CAD and other common complex disorders is likely carried by rare variants that may be identified in ongoing WES/WGS projects (as opposed to the relatively common variants discovered to date in GWAS ). In addition, epigenetic mechanisms will likely carry an as yet unknown fraction of complex disease heritability (4). Although defining the role of epigenetic mechanism requires other techniques, and despite ambiguity regarding how epigenetic modifications remain conserved across generations (1), these sources will undoubtedly reduce the fraction of missing heritability. Will they provide the full picture? Or are there fundamental problems with how we have been trying to understand and therefore seek heritability for complex traits? We believe this may be the case and that it is timely to critically review our understanding of complex disease inheritance by more carefully investigating how genetic risk variants interact with environmental factors to cause disease.
In this review, we question the notion that the heritability of common complex disorders is best revealed by traditional analyses of DNA sequence data in isolation, as thus far performed in GWAS and WES/WGS projects. Sequential analyses of DNA variants in these studies results in an overwhelming multiple-testing problem, and it can be questioned whether simply making the case-control cohorts larger to enable smaller differences to become genome-wide significant is the only reasonable path to follow. We also critically appraise the potential importance of identified risk loci by examining the suggested roles of their putative candidate genes in disease development. Using CAD development as an example, we suggest that genome-wide significant risk loci most likely underlie an early and protracted phase of CAD development, but are unlikely to regulate the rapid or late phases of CAD development that culminate in clinical events (Figure 1B). We instead believe that there may be a more important subpopulation of risk variants, which exert their effects on CAD only in certain environmental contexts. The context-dependence of this risk variant subpopulation is likely to render itself nominally significant (p < 0.05) (Figure 1C) in traditional analyses of GWA datasets considering DNA variants in sequence, but it would fail to reach genome-wide significance (p < 10−8) (Figure 2) (17). To grasp these assumptions, it is vital to understand how complex diseases develop. In the following section, we evaluate atherosclerosis and plaque development from the perspective of the likely timing and pathogenesis of genetic effects in CAD.
Atherosclerosis development culminating in clinical manifestations—myocardial infarction and stroke
Generally, diseases are believed to develop according to a sigmoidal (S-shaped) curve (19): commencing slowly with a positive acceleration phase; then increasing rapidly, approaching an exponential growth rate as in a J-shaped curve; and finally, saturating, stabilizing at a near-zero growth rate.
The development of atherosclerotic lesions in the coronary tree generally agrees with this S-shaped model, although the final phase of plaque development and progression may be variable and can include further progression (Figure 1B) (20). The notion that atherosclerosis develops slowly over a very long period, followed by more rapid progression, is supported by studies in mice (21,22) and humans (23,24). Briefly, early atherosclerotic lesions develop slowly, over 20 to 30 weeks in mice and probably over 30 to 40 years in humans, starting in adolescence (Figure 1B). Atherogenesis involves retention of circulating plasma lipoproteins, mainly low-density lipoprotein (LDL), at sites of turbulent blood flow. Some LDL particles remain and are modified by redox processes in the subendothelial space. Oxidized LDL subtly activates the endothelium, primarily by expressing adhesion molecules, which induce transendothelial migration of leukocytes, predominantly monocytes. Upon entering the subendothelial space and the intima, monocytes differentiate into macrophages, which take up oxidized LDL particles, initiating a key process of atherosclerosis: foam cell formation. As lipid-laden foam cells accumulate in the intima, fatty streaks appear as the first histologically-visible manifestation of atherosclerosis. This early phase of atherosclerosis development ends when foam cells within the fatty streaks start to aggregate, forming small atherosclerotic plaques with well-defined borders (21).
In the second phase, the small plaques expand rapidly, both across the arterial wall and, importantly, into the lumen of the artery, where they can compromise blood flow. The expansion phase is rapid (∼10 weeks) in mice, and evidence from 14C dating of human atherosclerotic plaques (23) suggests that it is also rapid (relative to lifespan) in humans (<10 years before clinical symptoms).
In the third and final phase, plaque biology can be quite variable, with approximately 30% of lesions showing rapid progression over 12 months to become fibroatheromas, with a lipid-rich core encapsulated by either a thin (10%) or thick cap (20%). Thin-cap fibroatheromas are considered the most unstable lesions and the most likely to lead to acute myocardial infarction. Over the subsequent 12 months, 75% of thin-cap fibroatheromas stabilize, whereas 5% of thick-cap fibroatheroma develop high-risk features (20,25). Furthermore, depending on its location, a mature plaque can have dramatic or minimal effects on blood flow, possibly leading to angina caused by myocardial ischemia of the heart muscle subtended by the plaque-narrowed artery.
A complex interplay of factors influences whether or not an advanced plaque ruptures, including: the extent and degree of lipid-core necrosis sustained by proliferating macrophages within the plaque; de novo monocyte migration/emigration; the degree of luminal stenosis; plaque burden; positive (outward) vessel remodeling; and the thickness of the fibrous cap (26). Rupture of a coronary plaque leads to intracoronary thrombus formation, which may or may not occlude the vessel. Although a nonocclusive thrombus may increase luminal narrowing and hasten lesion progression, an occlusive thrombus is associated with reduced myocardial perfusion that typically causes an acute coronary syndrome (myocardial infarction or unstable angina) or sudden death.
GWA Loci for Common Complex Diseases—How Important Are They?
Genome-wide significant loci identified by GWAS conducted for complex diseases may fail to identify central pathological processes and corresponding key DNA variants that contribute to heritability. This notion rests primarily on 3 conventions about the development and phenotypic expression of complex, relatively rare diseases (Figures 1A and B).
1 The case/control overlap
In comparing cases with population controls in studies of complex diseases (as opposed to rare diseases) (Figures 1A and B), overlap in central disease processes is inevitable, as these processes may be active without having caused clinical symptoms in the control subjects. Consequently, many DNA variants regulating genes active in these processes will not surface at a level of genome-wide significance in GWAS. However, due to a likely over-representation of these processes within cases (albeit not unique), DNA variants regulating these processes will likely instead present in GWAS with nominally significant p values.
2 The context of shifting environments
DNA variants that regulate genes active in central disease processes, and which do so independently of changing environmental contexts, are likely to surface with genome-wide significance in GWAS. In contrast, DNA variants that depend on pre-existing environmental contexts to regulate genes in central disease processes are unlikely to surface with genome-wide significance in GWAS. A principal reason is that many of these contexts are variably present (active), probably resulting in context-dependent risk variants presenting with nominally-significant disease associations in GWAS (Figure 1C). Contexts presenting at the macroenvironmental level in CAD are mainly lifestyle factors (smoking, diet, and sedentary lifestyle) or other major disease risk factors such as obesity, diabetes, hypertension, and certain inflammatory diseases. Such factors can be considered in GWAS (27), but most have not yet been accounted for. Other macroenvironmental factors, like highly stressful events (such as death of a spouse or natural disaster) (28,29) also predispose to myocardial infarction, but are harder to define (and thus to consider) in the individual patient. Macroenvironmental factors inevitably result in alteration of a wide array of microenvironments specific to tissues and cell types. The microenvironment in a given cell or tissue is the final determinant of whether a context-dependent DNA variant will be active (affecting gene activity) or not. Take inflammation as an example of a variable microenvironmental context. When early “fatty streak” lesions develop into plaques with intact borders, intralesional foam cells are believed to initiate inflammatory gene activation, which subsequently causes the rapid expansion phase of lesion growth (21,22). Previously silent DNA variants that specifically affect the activation of these inflammatory genes will now be suddenly relevant to disease progression, only to again become silent (or less active) in the late phase of lesion development. There are many reasons why a DNA variant can have changing effects on the genes it regulates—the most obvious being that the genes regulated by a given DNA variant may, during early phases of disease development, be largely silent (not expressed, as exemplified with inflammatory genes). A more complex reason is that a microenvironmental context (e.g., inflammatory stimulation) activates a specific cotranscription factor whose binding and effect depend on a given allele of a DNA variant. In this scenario, the microenvironmental perturbation (that is, inflammation) is needed for the regulatory effect of the DNA variant. Context-dependency of gene expression in this fashion has experimentally been shown to be both common and strong (30–33).
3 The context of time
DNA variants that regulate genes in central disease processes that are active (relevant) over a long period are more likely to surface with genome-wide significance, those that regulate disease processes over a short period (e.g., late in disease development). This is in part related to point 2, because DNA variants that depend on shifting environments for their effect on disease processes also are likely to affect the disease over a shorter time period.
Taking these 3 conventions together, we suspect that DNA variants identified as having genome-wide significance by GWAS are likely to regulate early disease processes in the slow initial growth phase, rather than those in the rapid growth or late phases (Figure 1B). In CAD development, for example, this assumption is partly based on the following developmental characteristics. First, the pathobiology of early CAD development is likely more genetically driven than the later phases and is less affected by environmental exposures linked to later life stages, like diabetes, hypertension, obesity, sedentary lifestyle, and inflammatory states (all less prevalent in adolescence and the early adult years). Also, as the first and longest phase of CAD development, the inclination of the curve of the slow phase will be highly decisive for how case and control subjects are defined in GWAS (Figure 1C). Conversely, the most significant disease-associated DNA variants, defined by comparing case and control subjects (genome-wide significant variants), will likely point toward genes active in the early phase. In contrast, the late and rapid phases are driven more by disease processes that are often shared between case and control subjects (e.g., obesity, hypertension, dyslipidemia, and diabetes) and that are more influenced by environmental factors compared with the early phase. Thus, DNA variants affecting later CAD phases are more likely to be context-dependent. Adding to the complexity of the later phases is that they commonly involve many other parallel disease processes acting across several organs. For CAD, these organs are primarily metabolic (e.g., the liver and pancreas with diabetes, adipose tissue with obesity, or systemic immune activation). Thus, although late processes likely involve additional systemic contributions linked to context-dependent DNA variants that are unlikely to be explained by genome-wide significant loci detected by GWAS, early processes are more likely to be genetically driven and exposed to fewer confounding factors, and therefore, are more likely to contain regulatory DNA variants of genome-wide significance.
Multiple ongoing studies seek to better understand the mechanisms of the 46 genome-wide significant loci thus far identified by GWAS for CAD, and whether they relate to early or late events in CAD development (7). If (as we suspect) most of these CAD loci (and loci for other complex diseases that develop in a similar fashion) are related to early events in disease development, their clinical usefulness may be limited primarily to guiding preventive measures and, possibly, to developing therapies against early disease development (primary prevention). Conversely, the usefulness of these findings for secondary prevention, to prevent the rapid and late phase of CAD development, would be restricted.
To test the assumption that DNA variants with genome-wide significance mainly regulate early CAD development, we examined candidate genes assigned to CAD GWA loci (12). Of the 50 candidate genes proposed for 46 loci confirmed in a meta-analysis of GWA datasets, 10 are involved in regulating plasma lipid levels (7 for LDL, 1 for high-density lipoprotein, and 2 for triglycerides) (8). Plasma lipids are primarily of importance for driving early atherosclerosis development, consistent with the notion that loci identified by GWAS will be more useful for primary prevention and with the experimental finding that atherosclerosis regression in response to LDL lowering is much greater for early lesions than for mature and advanced lesions (22). An additional 6 candidate genes are involved in hypertension, which is clearly important for early endothelial activation; however, its importance for later phases of CAD is less clear. In fact, the guidelines for hypertension treatment in the elderly (>60 years of age) were recently altered; the blood pressure goal has now been eased to <150/90 mm Hg (34) because the risk for stroke and CAD in this group was not increased at the previous lower blood pressure limit (<140/85 mm Hg), as it is in younger people (35,36). These insights again highlight that the variants that drive early atherogenesis are likely to be over-represented among DNA variants with genome-wide significance in GWAS. Interestingly, the only GWA locus (rs579459) involving blood groups (and the gene ABO) is not linked to CAD, but is linked to myocardial infarction (37). Adding weight to our argument that SNPs identified by GWAS are generally associated with early CAD development, this ABO-associated locus is the only SNP thus far identified by GWAS associated with myocardial infarction (obviously a very late event in CAD development) (37).
For 35 of the 50 candidate genes, their role in atherosclerosis is unknown. Many, like the suggested mechanism for the 9p21 locus, appear to involve the vascular wall, which indicates that they are likely to be primarily involved in early atherosclerosis development. However, perhaps the most compelling evidence for the hypothesis that GWA loci identified for CAD primarily reflect early atherosclerotic development is the lack of hits for inflammatory or immune responses, which are thought to be a central and causal disease mechanism, particularly for late stages of atherosclerosis and CAD (38,39). Further supporting the notion of late activation, microarray studies of atherosclerotic lesions during their progression show activation of inflammatory genes predominantly in the late stages (21). We, therefore, suggest that the conspicuous absence of inflammatory regulators among currently-identified genome-wide significant variants for CAD strongly signals that the GWA approach does not capture the full spectrum of genetically-driven events of coronary atherosclerosis.
In summary, considerable evidence supports the notion that genes so far identified for GWAS loci predominantly regulate early CAD development. We expect that the extensive ongoing studies into the molecular mechanisms of the 46 confirmed GWA-defined CAD loci will shed light on this issue, as these mechanisms will likely be traceable to early versus late events in the pathogenesis of atherosclerosis.
Systems Genetics—Identifying Disease-Driving Networks and Their Genetic Regulation
The proposed conventions synopsize into early CAD development being governed by genetic variants of molecular disease processes that are persistent and less affected by environmental contexts, as opposed to those governing later phases of CAD (Figure 1C). They are consequently more likely to have surfaced as genome-wide significant in GWAS, at least from how these datasets have been analyzed to date.
A pertinent question then arises: how might one identify context-dependent DNA variants regulating disease processes active over a limited time, as in the rapid expansion and late phases of CAD development, that are not recognized as genome-wide significant variants in a traditional DNA analysis? As these late processes are believed to involve many (as opposed to isolated) causative DNA variants with varying context-dependence (shifting with microenvironments and time), we believe that the key lies in first defining the molecular processes driving these later phases in complex diseases, and then identifying the DNA variants that causally regulate them, thus allowing their contribution to heritability to be weighed. By first addressing the molecular underpinnings of variable complex disease processes, we may be able to unmask a substantial portion of the missing 90% of the heritability of CAD and other complex diseases.
How can this be best achieved? The increasingly frequent answer is systems genetics (40–43). To us, the aim of systems genetics can be summarized as using genomic activity measures (e.g., ribonucleic acid [RNA], proteins, metabolites, and DNA modifications) to define disease-driving molecular processes and integrate them with GWA datasets, thereby permitting their contribution to complex disease heritability to be understood. However, the ultimate goal must be to enable diagnosis and treatment of patients on the basis of the status of these complex disease processes and to modulate pathological activity toward a nonpathological state.
It is increasingly understood that individual genetic variants, individual genes, or even linear pathways will never explain the intrinsic complexity of molecular processes underlying common diseases like CAD. Instead, these processes have polygenetic regulation and consist of multiple genes interacting in highly complex, fluid, and dynamic biologic networks reminiscent of intricate wiring diagrams. Fortunately, biological networks are sparse, with most genes (nodes) having only a small number of interaction(s) with other genes (edges), and with only a few highly interconnected nodes acting as hubs with many edges (44). These features can be identified from measures of genome activity (45). Furthermore, biological networks are well-conserved throughout evolution and, because of built-in redundancy, are biologically robust to an individual node’s loss (46). In parallel, technological advances in screening genomes and genome activity with ever-greater reliability and lower cost, together with increasing capacities for computational analysis of large datasets, have set the stage for more widespread use of systems genetics in biology, medicine, and health care (17).
Presently, causal disease networks are mostly inferred from the combination of genotype (DNA) and gene expression data in genetics of gene expression studies (GGES) (Figure 3). Although beyond the scope of this review, this is achieved using network inference algorithms for coexpression (i.e., weighted coexpression networks analysis), Bayesian probabilistic network models (47–50), and direct statistical tests for causality (51,52). To date, most algorithms are designed to infer disease networks from gene expression data generated by microarrays. More recently, modified algorithms that also infer biological networks from heterogeneous next-generation sequence datasets (e.g., RNA sequence) are emerging (53).
In our CAD research, we have focused on GGES of multiple tissues (Central Illustration), namely the STAGE (Stockholm Atherosclerosis Gene Expression) (7) and STARNET (Stockholm Tartu Atherosclerosis Reverse Network Engineering Task) studies. STAGE was a pilot study for STARNET, with 100 and 900 cases, respectively. Subjects were recruited from patients undergoing open thorax surgery; those having coronary artery bypass grafting served as cases, and those without atherosclerosis or CAD (confirmed by pre-operative angiography) undergoing other forms of open thorax surgery (e.g., isolated mitral valve repair) served as controls.
Parallel sampling of up to 9 CAD-relevant tissues from each patient is a key aspect of the STAGE and STARNET studies (7). RNA samples from case and control subjects were obtained from the arterial wall, liver, visceral abdominal fat, skeletal muscle, subcutaneous fat, primary monocytes, and monocytes that were differentiated in vitro into macrophages and foam cells. The 9 RNA samples were then converted into microarray data (STAGE, custom-made HuRSTA-2a520709 arrays [Affymetrix, Santa Clara, California]) and, more recently, RNA sequence data (STARNET). These RNA expression datasets are now used: 1) to infer causal regulatory disease-driving molecular processes, as reflected in gene networks operating both within and across tissues to cause CAD; and 2) to identify DNA variants that modulate these networks (7). We believe that the STAGE/STARNET datasets are unique in allowing us, for the first time to our knowledge, to study the inherent complexity of the molecular process underlying the late, possibly rapid phases of CAD development across the 9 collected tissues.
Even before inferring gene networks, the STAGE/STARNET dataset can be used to identify expression quantitative trait loci (eQTLs) (54–58) in CAD, especially as related to established GWA loci. An eQTL is a DNA variant (frequently an SNP) that regulates gene expression levels. They are determined by linking alleles of the SNPs found by genotyping the patient’s DNA (e.g., in STAGE and STARNET from Affymetrix GenomeWideSNP_6 arrays) with gene expression levels from the various tissues. SNP alleles associated with different levels of gene expression, and therefore acting as eQTLs, can then be identified.
Using the STAGE data, we recently identified 8,156 eQTLs for 6,450 unique genes across 7 CAD-relevant tissues (59). By integrating the analysis with 2 independent GWA datasets for CAD, the Myocardial Infarction Genetics Consortium (60) and the Wellcome Trust Case Control Cohort (61) to assess the enrichment of these eQTLs in inherited risk, we discovered that those eQTLs regulating gene activity across greater numbers of tissues lead to increased CAD risk. Furthermore, eQTLs that were operative across several tissues resided at regulatory genomic “hot spots” (62). In contrast, most of the 22 eQTLs identified in the STAGE study that were established as “CAD GWA hits” affect gene expression in a single tissue, or, at most, in 2 tissues. In our view, the multitissue involvement of the risk-enriched eQTLs suggests that they regulate molecular processes acting across several of these tissues (63) in late CAD development and also may be important contributors to the inherited CAD risk. In contrast, that the 22 STAGE eQTLs were established GWA loci for CAD regulated genes mostly in 1 or 2 tissues is consistent with their involvement in early CAD development.
The STAGE/STARNET datasets were next used to define groups of genes acting together in modules and networks, primarily on the basis of similar coexpression (17,52,64,65) across the 9 tissues. After identifying these modules, an important next step is to link them to relevant patient phenotypes. For example, we calculated the eigengene value representing the sum of all gene expression values in a given module/network (66), which can subsequently be used to correlate modules with phenotypic characteristics of STAGE and STARNET patients. Gene modules associated with key CAD phenotypes, such as the angiographic SYNTAX score (from pre-operative angiograms) and plasma LDL cholesterol levels, can thus be identified. For modules with strong phenotypic associations, Bayesian network algorithms or other statistical causal inference techniques are applied, incorporating information on eQTLs to determine modules that are causally related to CAD, as opposed to those that are reactive or independent. The Bayesian networks can also be employed to infer key driver genes important for regulating the state of the module (50,67,68), where they may serve as diagnostic markers, therapeutic targets, or both (see “Drug Repurposing” section).
Applying a systems genetics approach in clinical medicine (46) is certainly not limited to DNA genotype and RNA expression studies (Central Illustration), and can include more than 1 targeted tissue (Figure 3) (69). The more clinical information obtained from the study subjects the better. Genotype and RNA expression data can be linked to clinical images and histology that indicate stages of severity in cancer development (70,71) and diseases of the central nervous system (72). Genome-wide analyses of proteins, metabolites, and lipids may also be considered. Because of their rapid turnover, tissue protein profiles (as opposed to RNA expression) are more variable, and it is now appreciated that protein, metabolite, and lipid analyses are not as well-suited for systems analyses at the genome-wide level in individual human tissues, given the current state of the technology to measure these different dimensions. However, patterns of protein and metabolite expression in plasma may be an exception (73,74), and integrating RNA expression with protein and metabolite profiles can greatly enhance the predictive power of disease gene networks (75).
Taking all of the previously-mentioned arguments into consideration, we advocate a systems biology-type clinical study design with the basic features outlined for STAGE and STARNET, which we have termed genome-wide network studies (GWNS). In addition to DNA and careful clinical phenotypes, this would also include multiple intermediate phenotypes, such as RNA, proteins, metabolites, and screening of molecules that modify the structures of DNA/RNA/proteins (that is, epigenetics) in all tissues relevant for the disease in question. We believe that GWNS will help us to understand the molecular mechanisms underlying genome-wide significant loci identified by GWAS and, eventually, by WES/WGS. However, perhaps more importantly, GWNS will help us to identify the variety and full spectrum of molecular processes driving complex diseases. We think the way forward is to establish network models of these processes, both to uncover the fraction of missing heritability of complex diseases, and also to eventually establish a new paradigm of healthcare on the bases of molecular diagnostics and individually-tailored therapy.
Importantly, in applying GWNS, or any high-dimensional data analysis approach to complex traits, it is important to consider strict statistical thresholds correcting for multiple testing. For example, false discovery rate control (a form of statistical correction for multiple comparisons) should be used when detecting eQTLs (76) and when assessing gene network associations (66). Of note, as opposed to testing individual genes/SNPs in GWAS, GWNS reduce the problem with multiple testing by 1 to 3 orders of magnitude. Nonetheless, the issue of multiple testing remains relevant for GWNS.
For network associations (e.g., eigengenes) with phenotypes, the family-wise false positive rate is rigorously controlled by empirically estimating the null distribution of those associations via permutation testing and then controlling the false discovery rate by setting an appropriate p value threshold on the basis of that distribution. The discovery of eQTLs is similarly assessed by controlling for the false discovery rate. The network itself is a stable structure in that if we randomize the data so that the correlation among molecular features is destroyed, no credible network structure results (the resulting network is not scale-free, no subnetworks/modules are identified, and so on). Regarding pathway enrichment in subnetworks, their significance in the context of multiple testing is again assessed by empirically estimating the distribution of the enrichments in the context of the network topology, shuffling gene names in the network, but maintaining network topology. On this basis, p value thresholds are set to control the false discovery rate.
The relative focus of the STAGE and STARNET studies on the transcriptome (as opposed to the epigenome/proteome/metabolome) is for several reasons. Besides the limited amount of tissue sample and superior technology development for RNA screening, the transcriptome also appears sufficiently stable to capture meaningful variations relating to disease development. However, protein turnover is far more rapid, which introduces additional biological variability such that patient-to-patient comparisons of the proteome are more challenging. Nonetheless, we are strong advocates of integrating all types of “-omics” data in GWN analyses, and have previously shown that integrating proteome data with the analysis of DNA and RNA improves the predictive power of the ensuing networks (67). Therefore, although prone to greater variability, we believe that the assessment of plasma proteins remains particularly important for GWNS of CAD and other complex traits. Systemic integration of plasma proteins with established roles in CVD (Online Table 1) (77) and unbiased mass-spectrometry analysis (78) are currently being performed in plasma from the STARNET study primarily to identify novel markers for risk of clinical CAD events and, potentially, for CAD therapy (79). Similarly, we are working on computational strategies to integrate genome-wide epigenetic measures in GWN analyses (80).
Another pertinent question for GWNS is that of ethnicity. The STAGE and STARNET study participants are predominantly of northern European ancestry, which is similar to European American (EAs). But, what is the relevance of the STARNET GWNS for CAD in African Americans (AAs) and Hispanic Americans (HAs)? CAD risk factors (and thus, with a high degree of certainty, the networks/pathways in which they operate) are generally believed to be similar across ethnicities. This would suggest that the CAD networks to be inferred from the STARNET datasets should also be relevant to CAD in AAs and HAs. The frequent, successful use of animal models, predominantly mouse, to study CAD/atherosclerosis suggests that, at least for the early stages of CAD, many disease pathways are similar, even across species. Importantly, however, even if the main CAD risk factors are operative across most or all ethnicities, this is different from stating that every risk factor is equally important across ethnicities. In fact, the relative importance of CAD risk factors in EAs, AAs, and HAs differs. For example, insulin resistance, hypertension, and obesity are much more prevalent causes of CAD in AA than in EA (81). Thus, STARNET should improve our general understanding of causal CAD networks and their key drivers. To decipher the relative roles of these CAD networks in individual ethnicities, 1 strategy is to examine inherited risk profiles of CAD networks for AA, EA, and HA. For this, it is necessary to compute CAD network eQTLs for associations with CAD (risk) using GWA datasets specific to these ethnicities (12,82,83) (see part 2 in “The Role of GWAS in the Era of Systems Genetics”). Nonetheless, additional GWNS are certainly required on the basis of study designs similar to those of STAGE and STARNET, but on non-Caucasians, and preferably across the entire spectrum of complex disorders.
In this review, we suggest that we are on the verge of a new era of discovery of the genetics of CAD and other complex disorders, primarily on the basis of GWNS and GGES. We therefore strongly advocate that additional GWNS should be encouraged by funding bodies, as they hold great promise to decipher complex disease etiologies and represent an alternative route to extract further meaningful information from existing GWA datasets.
The Role of GWAS in the Era of Systems Genetics
Although GWAS, followed by WES and WGS (“GWA datasets”), will remain fundamentally important in the search for the genetic causes of disease, we believe that integrating these datasets in GWNS provides a parallel approach for clinical studies that should help to define additional genetic regulators of CAD and other complex diseases. We anticipate that GWNS may uncover a significant portion of the missing heritability. We suspect that GWA datasets contain untapped information about the heritability of complex diseases and that, in the era of systems genetics, by integrating the analysis of GWNS with GWA datasets, we can prioritize risk variants that fail to reach genome-wide significance. With this perspective, we foresee that GWA datasets will be reutilized in at least the following 3 ways.
1 Reanalysis of GWA datasets based on common risk factors for CAD
Reanalyzing GWA datasets subcategorized on the basis of common risk factors for CAD is a straightforward, but remarkably seldom-used, strategy to identify DNA variants for heritability of complex diseases. For example, a recent study focusing on the chromosome 9 locus for CAD (27) concluded that risk variants identified by GWAS can help explain the risk of CAD in particular subgroups of patients defined by traditional risk factors. To extend such analyses beyond risk variants identified by GWAS, we suggest that GWA datasets be reanalyzed after subjects are sorted into groups with and without given risk factors. The underlying objective is to increase the likelihood of identifying DNA variants of genome-wide significance in CAD cases with a given risk factor (e.g., diabetes, hyperlipidemia, or hypertension). Such a strategy would not merely help to reidentify established CAD risk variants discovered by traditional case-control GWA comparisons, but will also help to identify risk variants previously found to be suggestive or to have nominally-significant associations with CAD (Figure 2). The results might point toward additional molecular mechanisms important for risk assessment and, possibly, for therapies for CAD patients with certain risk factors.
2 Reusing GWA datasets to define inherited risk-enrichment of genes believed to be involved in complex disease development
Another strategy for reusing GWA data is to consider inherited risk-enrichment analysis of groups of genes suspected to be associated with disease (84,85) (Central Illustration). These genes can either be differentially expressed between disease and control samples (22) or active in disease-related modules (7), networks, or pathways (86–88). Regardless of the gene list origin, the principal concept is that DNA variants regulating the genes of interest (eQTLs) should carry increased association for complex disease (i.e., be risk-enriched) if they are causally related to the disease. In brief, SNPs corresponding to the eQTLs, and highly-correlated SNPs in their immediate vicinity (“the experimental set”), are examined for disease association in a relevant GWA dataset. Typically, nominal significance (p < 0.05) is chosen as the threshold for “disease association,” but this may vary between studies. Next, control sets (n > 5,000) containing the same number of SNPs as the experimental set are randomly selected (located on the same chromosome and in areas with similar gene density). The fold risk-enrichment is determined by comparing the number of disease-associated SNPs in the experimental set with the average number of disease-associated SNPs in the control sets. We believe this is an especially promising technique to evaluate a set of genes for their relevance to a given disease. In contrast to gene ontology or pathway gene enrichment analysis, the method is data-driven and unbiased. In addition, a gene set regulated by risk-enriched eQTLs, according to the GWA dataset, is not merely involved in disease; the risk-enrichment also indicates that the genes are causally related to disease, as opposed to being reactively related. This strategy’s power has been demonstrated in studies of type 2 diabetes (89), CAD (59,87,88), and early-onset Alzheimer’s disease (72).
3 Reanalysis of GWA datasets by subcategories defined by the status of disease-driving networks
As we suggested earlier for established macroenvironmental risk factors, the status of disease-driving molecular processes at the microenvironmental level represented by gene networks (defined by their gene connectivity and activity) can be used to assign patients to well-defined subgroups. These subgroups can then be used to reanalyze GWA datasets to identify genome-wide significant risk variants associated with the status of these complex disease processes.
Therapeutic Targeting of Candidate Genes for GWA Loci and Drug Repurposing of Key Drivers in Disease-Driving Molecular Networks
Many research programs are working to understand the biological mechanisms underlying the genome-wide significant risk variants identified by GWAS (90), with the goal of targeting these novel mechanisms therapeutically. Although we believe this approach will primarily target early CAD processes, it may also point toward pathways that can be modulated to affect the disease in its later stages. For instance, targeting PCSK9 to improve plasma cholesterol lowering may indeed help reduce risk in patients with more advanced forms of CAD (91).
Nevertheless, we believe that targeting key drivers of disease networks will be an equally, or even more, successful strategy. For this purpose, we developed an integrated informatics approach to systematically screen genome-wide transcriptional signatures of drug perturbation (treated vs. untreated) from the public domain (e.g., Connectivity Map ) against genome-wide transcriptional signatures of disease states (on up to networks) to identify repurposing candidates (93,94). Drug screening approaches typically require knowledge of drug target profiles or mechanisms of action, which is often lacking. In contrast, our approach requires no knowledge of the mechanism of action and can consider the system-wide properties of drug-induced molecular perturbations (e.g., genome-wide transcriptional changes) to enable discovery of novel connections between drugs and disease states through a data-driven approach (94). This approach also allows for rational repurposing of multitarget compounds (polypharmacology) that might exhibit therapeutic effects against a complex disease like CAD through modulation of multiple network driver nodes (90).
The broad biological relevance of this approach is supported by experimental validation of several novel drug indications identified by a computational drug-repurposing pipeline. This system was recently used to transcriptionally profile intestinal samples obtained from inflammatory bowel disease (IBD) patients to experimentally validate a novel IBD indication predicted for the anticonvulsant agent, topiramate (94). Topiramate has no history of efficacious use for IBD or other inflammatory diseases. There are no established therapeutic targets for IBD or other inflammatory diseases among the canonical targets of topiramate, which enhances GABA-A receptor activity, antagonizes AMPA/kinate glutamate receptor subtypes, and weakly inhibits carbonic anhydrase isozymes II and IV (93). In a rodent model of IBD induced with 2,4,6-trinitrobenzenesulfonic acid, topiramate significantly reduced the severity of IBD, as judged by gross pathophysiological and histopathological measures (94).
In a separate study (95), this computational drug-repurposing approach was applied to transcriptional profiles of tumor versus adjacent normal tissue to identify novel drug-repurposing candidates for small-cell lung cancer (SCLC). Several pharmacologically-diverse compounds were identified as novel drug-repurposing candidates for SCLC, including: the tricyclic antidepressant, imipramine; the calcium-channel blocker, bepridil; and the phenothiazine antihistamine, promethazine. Anti-SCLC or other antineoplastic effects were not previously established for these drugs or for other drugs of the same pharmacological class. The anti-SCLC activity of these compounds was validated in experiments in human and animal model systems in vitro and in vivo (95).
In summary, given the shortage of new drugs reaching the market for CAD and many other complex diseases, drug repurposing, using a systems genetics approach to define new agents and novel indications for existing therapies, will be an essential path toward personalized and preventive drug therapies.
Summary and Future Directions
To answer our own question posed in the title of this paper, there is no doubt that GWAS are important in providing datasets to reveal risk variants that explain the heritability of complex diseases. In particular, genome-wide significant loci point to potentially important genes and molecular mechanisms responsible for the pathobiology of these diseases. However, we believe that traditional analyses of GWA datasets overlook the potential role of context-dependent risk variants that exert risk only when certain environmental influences are operative. These influences typically arise over a shorter period and, increasingly, at later stages of disease development. We believe these context-dependent risk variants can have a large effect on key disease processes active during limited windows of complex disease development. As such, they are unlikely to emerge in traditional GWA dataset analyses as of genome-wide significance, but are likely to be detected as suggestive or nominally-significant risk variants. If this is the case, much information on the heritability of complex disease remains hidden in GWA datasets. We believe that, by applying GWNS to identify the disease-driving molecular processes reflected in molecular networks and their genetic regulators, this information can be revealed. We propose that these disease-driving gene networks can distinguish true- from false-positive risk variants with nominal disease associations in GWAS. As more network biology underlying complex disorders is revealed, the relevant activity and type of disease network can be used as a diagnostic marker to subcategorize patients into groups requiring different therapeutic measures. To reach this goal, we must enter a post-GWAS era, in which priority is given to clinical studies that include intermediate phenotypes (with the general design described earlier for GWNS, but also considering other genome-wide measures besides RNA) and where the screening of patients ranges from gross disease phenotypes (i.e., CAD), to histological and image-based patient characteristics, and ultimately to clinical outcomes. Such studies will be essential for solving the puzzle of interconnected molecules (e.g., genes) in disease-driving networks, perhaps allowing fulfillment of the long-sought goal of the “genomic revolution”: the preventive and individual care of patients in molecularly-defined subcategories.
For a supplemental table, please see the online version of this article.
Drs. Björkegren, Kovacic, and Schadt are supported by the American Heart Association (14SFRN20490315; 14SFRN20840000) and are members of the Consortium “CAD Genomics” (Drs. Björkegren and Schadt) and the Consortium “Cellular and Molecular Targets to Promote Therapeutic Cardiac Regeneration” (Dr. Kovacic), both of which are funded by the Leducq Foundation (Transatlantic Network of Excellence Awards). Dr. Björkegren is also supported by the Swedish Heart-Lung Foundation, the Swedish Research Council, the University of Tartu (SP1GVARENG), the Estonian Research Council, and by a grant from AstraZeneca Translational Science Centre-Karolinska Institutet (joint research program in translational science); and is the founder, a main shareholder, and chairman of the board of Clinical Gene Networks AB (CGN), which has invested interests in the STARNET and STAGE cohorts. Dr. Kovacic is also supported by the National Institutes of Health (K08HL111330) and by a research grant from AstraZeneca. Dr. Dudley is supported in part by funding from the National Institutes of Health (R01 DK098242 and U54 CA189201); and by the PhRMA Foundation. Dr. Schadt is a CGN board member and shareholder. Robert Roberts, MD, served as Guest Editor for this paper.
- Abbreviations and Acronyms
- coronary artery disease
- expression quantitative trait locus
- genetics of gene expression studies
- genome-wide association
- genome-wide association studies
- genome-wide network studies
- low-density lipoprotein
- single-nucleotide polymorphism
- whole-exome/whole-genome sequencing
- Received December 1, 2014.
- Accepted December 19, 2014.
- American College of Cardiology Foundation
- Cortijo S.,
- Wardenaar R.,
- Colome-Tatche M.,
- et al.
- Felsenfeld G.
- Friend S.H.,
- Schadt E.E.
- Hägg S.,
- Skogsberg J.,
- Lundström J.,
- et al.
- Roberts R.
- Helgadottir A.,
- Thorleifsson G.,
- Manolescu A.,
- et al.
- McPherson R.,
- Pertsemlidis A.,
- Kavaslar N.,
- et al.
- ↵Hindorff LA, MacArthur J, Morales J, et al. A catalog of published genome-wide association studies. Available at: http://www.genome.gov/gwastudies. Accessed December 21, 2014.
- Schadt E.E.,
- Bjorkegren J.L.
- Allaby M.
- Narula J.,
- Kovacic J.C.
- Kubo T.,
- Maehara A.,
- Mintz G.S.,
- et al.
- Bentzon J.F.,
- Otsuka F.,
- Virmani R.,
- et al.
- Ogihara T.,
- Saruta T.,
- Rakugi H.,
- et al.,
- for the Valsartan in Elderly Isolated Systolic Hypertension Study Group
- Libby P.,
- Ridker P.M.,
- Hansson G.K.,
- Leducq Transatlantic Network on Atherothrombosis
- Lusis A.J.,
- Weiss J.N.
- Weiss J.N.,
- Karma A.,
- MacLellan W.R.,
- et al.
- Zhang B.,
- Horvath S.
- Kang H.P.,
- Morgan A.A.,
- Chen R.,
- et al.
- Foroughi Asl H.,
- Talukdar A.H.,
- Kindt A.S.,
- et al.
- Yang X.,
- Zhang B.,
- Molony C.,
- et al.
- Lin B.,
- White J.T.,
- Lu W.,
- et al.
- Chan J.C.,
- Piper D.E.,
- Cao Q.,
- et al.
- Shang M.M.,
- Talukdar H.A.,
- Hofmann J.J.,
- et al.
- Rashid S.,
- Curtis D.E.,
- Garuti R.,
- et al.
- Lamb J.,
- Crawford E.D.,
- Peck D.,
- et al.
- Knox C.,
- Law V.,
- Jewison T.,
- et al.
- Dudley J.T.,
- Sirota M.,
- Shenoy M.,
- et al.
- Jahchan N.S.,
- Dudley J.T.,
- Mazur P.K.,
- et al.
- GWA Loci for Common Complex Diseases—How Important Are They?
- Systems Genetics—Identifying Disease-Driving Networks and Their Genetic Regulation
- The Role of GWAS in the Era of Systems Genetics
- Therapeutic Targeting of Candidate Genes for GWA Loci and Drug Repurposing of Key Drivers in Disease-Driving Molecular Networks
- Summary and Future Directions