Segmentation of the S. cerevisiae genome based on sparse genomic, expression and epigenetic data

Georgoulopoulos, Michail

Segmentation of the S. cerevisiae genome based on sparse genomic, expression and epigenetic data

Title in other language Μέθοδοι Τμηματοποίησης (Segmentation) Ευκαρυωτικών γονιδιωμάτων μεβάση αραιά (sparse) γονιδιωματικά δεδομένα έκφρασης (RNASeq) και επιγενετικης (english)

Entity typeMSc thesis
Author Georgoulopoulos, Michail
Department Βιοπληροφορική και Νευροπληροφορική (ΒΝΠ)
Date of work 31 July 2021 [2021-07-31]
Work language Αγγλικά
Number of Pages 25
Supervisor Nikolaou, Christoforos
Committee members Exarchos, Themis | Vlamos, Panagiotis
Keywords Sacharomyces | cerevisiae | genome | segmentation | Hi-C
Number of Annexes2
Number of Greek bibliographic references3
Number of international bibliographic reference 15
Description Figure 1: Gene duplication signal mapped to three-dimensional space. Figure 2: Resampling of the Duan et al. model by linear interpolation. The center base pair (C) of each gene is found between two successive control points (A and B) of the source model. A weighed average of these two points is assigned as the position of the gene in space. Figure 3: Sphere-test data flow. Figure 4: Distribution of histone-profile average in-group Euclidean distance (top) and distribution of histone profile class Shannon entropy (bottom) of local gene sets (clusters and segments, respectively) in contrast to distribution of the same quantity in random samples of the same gene count. Both are characterized by a shift towards smaller values, which signifies a tendency of similar histone modification profiles to co-localize in space. Figure 5: Within-cluster variance as a function of cluster count. No abrupt drop in variance can be found, which makes it impractical to select a cluster count via the “elbow” method. We resort to x-means, which optimizes the Bayesian information criterion. Under this criterion, a model of 7 clusters best explains the data. Figure 6: Rough features of S. cerevisiae genome. Figure 7: Continent-inspired compartmentalization of the yeast genome. The primary distinction is between areas of high (land) and low (sea) coexpression scores. Figure 8: Box plot of average coexpression score between all ordered gene pairs of each continent compartment. The low coexpression score compartment (Tethys) is below average (Random, which is the global average of all possible gene pairs), while “land” compartments (Laurasia, Godwana, Antarctica) are above average. Figure 9: g:Profiler ontology enrichment results for Tethys compartment. Figure 10: Histone profile clusters. Collections of genes having a lower than average (adjusted p-value less than 1%) variance in histone modification profile than average. Each of these clusters has a unique histone modification profile. Figure 11: Heatmap visualization and hierarchical clustering of shifts in histone variant abundance for each cluster. Over-representation of a particular histone variant is colored in green, under-representation in red. Notice the gap between over- and under-representation. All locally similar clusters are shifted in all 9 dimensions with a quantized (step-wise) shift. Figure 12: g:Profiler ontology results for Motif cluster A. Figure 13: Heatmap visualization of histone profile space Euclidean distance between all pairs of genes in chromosome III. Small communities of very similar histone profiles can be seen along the diagonal, as black rectangles. Figure 14: Heatmap visualization and hierarchical clustering of histone variant abundance for each of the 7 histone profile classes. Differences from average are more pronounced in the case of communities (in contrast to histone profile clusters), which allows us to plot the original signal, instead of the shifts from average. Figure 16: Gene pair distance distribution. Comparison between gene pairs picked from the histone class entropy-filtered segments against randomly-picked gene pairs. Table 1: Signals used and respective statistics computed. Table 2: Statistical significance of the difference in coexpression between each of the compartments and the entire genome. A combination of all high-coexpression compartments, named Pangaea, maintains statistical significance against the whole genome. Table 3: Enrichment in “whole genome” (WGD) and “small scale” (SSD) duplicated genes in each of the continent compartments. Table 4: Overlap of histone profile clusters with the continent compartments. Clusters of local similarity regarding histone modification profiles have a strong preference towards “land” compartments, which in turn are characterized by increased coexpression and enrichment in WGD genes. Table 5: Enriched gene ontology terms found in histone profile clusters. Table 6: Enrichments in SSD and WGD per histone profile cluster. Table 7: Gene ontology enrichments per motif cluster. Table 8: Gene ontology enrichments per histone profile community.
Abstract
- Στην παρούσα εργασία διερευνούμε το γονιδίωμα του σακχαρομύκητα με σκοπό να εντοπίσουμε περιοχές τοπικής ομοιότητας έναντι ορισμένων χαρακτηριστικών. Χρησιμοποιούμε χαρακτηριστικά (τα οποία συλλογικά καλούνται “σήματα”) όπως προφίλ τροποποίησης ιστονών, μοτίβα πρόσδεσης μεταγραφικών παραγόντων, σκορ γονιδιακής συνέκφρασης και συντήρησης. Αυτά τα σήματα είναι αραιά δειγματισμένα στο γονιδίωμα, με ανάλυση ενός δείγματος ανά γονίδιο. Θεωρούμε την τοπικότητα του σήματος τόσο στο γραμμικό χρωμόσωμα όσο και στον τριδιάστατο χώρο και αποκαλύπτουμε έναν αριθμό εντοπισμένων παρόμοιων γονιδίων. Το σήμα συνέκφρασης γονιδίων αποδίδει μια πλήρη διαμερισματοποίηση του τρισδιάστατου γονιδιώματος. Το μεγαλύτερο από αυτά τα διαμερίσματα έχει ιδιαίτερο ενδιαφέρον· χαρακτηρίζεται από χαμηλότερη από τη μέση συνέκφραση και είναι εμπλουτισμένο σε γονίδια που σχετίζονται με μεταβολικά μονοπάτια. Συμπληρωματικά με αυτή τη φυσική διαμερισματοποίηση, ορίζουμε μια δεύτερη, τεχνητή, με βάση την ιδιαίτερη γεωμετρία του γονιδιώματος του ζυμομύκητα. Ορισμένα προφίλ τροποποίησης ιστονών και μεταγραφικών παραγόντων βρέθηκαν να εντοπίζονται σε συμπαγείς περιοχές στον τριδιάστατο χώρο. Αξιολογούμε τα αντίστοιχα σύνολα γονιδίων σε σχέση με τον εμπλουτισμό σε διπλασιασμένα (SSD και WGD) γονίδια, τον εμπλουτισμό σε οντολογίες γονιδίων και την τοπολογία τους σε σχέση με τα προαναφερθέντα σχήματα διαμερισματοποίησης. Εξετάζουμε περαιτέρω το σήμα των προφίλ ιστονών με πιο λεπτομερή τρόπο, λαμβάνοντας υπόψη μικρές ομάδες διαδοχικών γονιδίων. Η τάση της τοπικής σύγκλισης είναι επίσης διαδεδομένη σε αυτήν την κλίμακα. Πολλές ομάδες γονιδίων με ισχυρή συγγένεια του προφίλ ιστονών είναι διασκορπισμένες σε όλο το γονιδίωμα. Χρησιμοποιούμε τα συγκεκριμένα προφίλ ιστονών τους για να ομαδοποιήσουμε όλα τα γονίδια σε κλάσεις προφίλ ιστονών και στη συνέχεια εξάγουμε έναν αριθμό τμημάτων με χαμηλή εντροπία σε αυτές τις κλάσεις. Συμπερασματικά, το γονιδίωμα ζύμης φαίνεται να βρίθει από περιοχές που παρουσιάζουν αμυδρή, αλλά στατιστικά σημαντική ομοιότητα. Περαιτέρω, τοπικά παρόμοια σύνολα γονιδίων που εξάγονται από ένα σήμα συχνά συνδέονται με σύνολα που εξάγονται από άλλο. Δεν είναι ασυνήθιστο να διαπιστώνουμε ότι το ένα είναι υποσύνολο του άλλου. Αυτά τα αποτελέσματα υποδηλώνουν την πιθανότητα υποκείμενης δομής. Συνεπώς, ενθαρρύνεται περαιτέρω μελέτη και η διερεύνηση διαφορετικών τύπων σημάτων. Για το σκοπό αυτό, προτείνουμε έναν αλγόριθμο γενικής χρήσης που πραγματοποιεί τέτοιες αναζητήσεις. Ο αλγόριθμος δεν κάνει υπόθεση για την κατανομή που ακολουθεί το υποκείμενο σήμα και απαιτεί μια ελάχιστη διαμόρφωση προκειμένου ώστε να προσαρμοστεί σε ένα διαφορετικό είδος σήματος, διευκολύνοντας έτσι μελλοντικές αναζητήσεις.
- In this work, we explore the budding yeast genome with the intention of identifying areas of local similarity against certain characteristics. We utilize such characteristics (collectively referred to as “signals”) as histone modification profiles, transcription factor binding motifs, gene coexpression patterns and sequence conservation. These signals are sparsely sampled over the genome, at a resolution of one sample per gene. We consider signal locality in both one- and three-dimensional space and uncover a number of locally similar gene sets. Pairwise gene coexpression score signal yield a complete compartmentalization of the threedimensional genome. The greater compartment is of particular interest; it is characterized by a lower than average coexpression score and enriched in genes related to metabolic pathways. In addition to this natural compartmentalization, we define a second, artificial one, based on the particular geometry of the yeast genome. Certain histone modification and transcription factor motif profiles are found to be localized in compact areas in threedimensional space. For histone modification profiles in particular, this tendency of local convergence is prevalent at the finer scale of a few successive genes. Numerous small gene communities of strong histone profile affinity are scattered throughout the genome, generally having one of a few distinct histone modification profiles. We evaluate the corresponding gene sets with respect to gene duplication (separately for small-scale and whole-genome gene duplicates) enrichment, ontology enrichment and their topology in terms of the aforementioned compartmentalization schemas. In conclusion, the yeast genome seems to be organized in extended regions of subtle, yet significant similarity. Furthermore, this organization traverses various levels that include regulation, epigenetic state and conservation, as locally similar sets of genes extracted by one signal are frequently associated with sets extracted by another. These results hint at the possibility of an overarching, underlying genome structure. Further study, utilizing different types of signals is therefore encouraged. To this end, we propose a general-purpose algorithm that carries out such locality searches. The algorithm makes no assumption of the underlying signal distribution and requires a minimal configuration in order to adapt to a different kind of signal and thus facilitate future searches.
Licence Αναφορά Δημιουργού 4.0 Διεθνές

Segmentation of the S. cerevisiae genome based on sparse genomic, expression and epigenetic data - Identifier: 170951

Internal display of the 170951 entity interconnections (Node labels correspond to identifiers)

Loading..

Legend

Navigation

Info

Controls

Narrowness

Inferred

Segmentation of the S. cerevisiae genome based on sparse genomic, expression and epigenetic data

Title in other language Μέθοδοι Τμηματοποίησης (Segmentation) Ευκαρυωτικών γονιδιωμάτων μεβάση αραιά (sparse) γονιδιωματικά δεδομένα έκφρασης (RNASeq) και επιγενετικης (english)

Main Files