(Pro-)Seminar:
Comparison and Clustering of Biological Molecules

Description

In bioinformatics, the comparison and clustering of biological molecules serve as essential methodologies for deciphering the intricate complexities of biological systems. These processes play a pivotal role in elucidating evolutionary relationships, functional similarities, and structural motifs across diverse organisms and biomolecules. Through sequence alignment, structural superposition, and clustering techniques, researchers can discern patterns, similarities, and differences within vast datasets of DNA, RNA, proteins, and small molecules. Such analyses not only aid in understanding the fundamental mechanisms governing cellular processes but also pave the way for advancements in drug discovery, protein engineering, and personalized medicine. Moreover, by organizing biological data into meaningful clusters based on shared characteristics, these methodologies streamline data interpretation, enabling researchers to extract valuable insights and make informed decisions in their quest to unravel the mysteries of life at the molecular level.

In this seminar, we will look at two groups of tools. The first group computes clusters directly while the second group computes pairwise inter-sample similarities or distances that can then be used by classical clustering algorithms such as spectral clustering or agglomerative clustering. The goal for this seminar is to equip participants with a comprehensive understanding of the principles, methodologies, and practical applications of comparing and clustering biological molecules using cutting-edge tools such as FoldSeek, MASH, and CD-HIT. By the end of the seminar, participants should be able to proficiently utilize these tools to perform sequence alignment, structural comparison, and clustering analyses on various biological datasets. Moreover, they should gain insights into how these analyses contribute to advancements in genomics, proteomics, and drug discovery.

Requirements

This (pro-)seminar has no formal requirements.

What Do You Need To Do In The Seminar?

How Are The Grades Computed?

Plagiarism

We will check every submission for plagiarism with TurnItIn. This is an online tool automatically checking submissions for plagiarism. You are free (and encouraged) to use it before submitting your final report. Following the link above, you can login with your UdS-credentials (as you use for the students email) and use TurnItIn for free. With attendance of this seminar, you agree that we upload your report to TurnItIn.
If we detect plagiarism in your work, you will have the chance to explain yourself. Ultimately, you will fail this seminar if your explanation is not convincing.

Registration

Please register to this seminar by writing an email to Roman Joeres  before 19.04.2023 23:59. Please also attach your transcript of records which can be downloaded from the LSF/HISPOS. We will distribute the topics among students in the mandatory-to-attend kickoff meeting at 23.04.2024 12 PM in E2.1 SR 007.

Other Organizational Things

Important Dates

Topics

  1. DIAMOND
    Anastasia Lesnikov | Supervisor: Roman Joeres
    Comparison of amino-acid sequences | DOI | GitHub
  2. MMseqs2
    Johanna Becher | Supervisor: Amay Agrawal
    Comparison & Clustering of amino-acid sequences | DOI | GitHub
  3. TM-align
    Johanna Straub | Supervisor: Guangyi Chen
    Comparison of amino-acid sequences | DOI | GitHub
  4. CD-HIT - Proseminar
    Maximilian Bähr| Supervisor: Amay Agrawal
    Clustering of amino-acid sequences & DNA/RNA | DOI | GitHub
  5. FoldSeek
    Zyad Ahmed | Supervisor: Roman Joeres
    Comparison of protein structures | DOI | GitHub
  6. Weisfeiler-Lehman Graph Kernels - Proseminar
    Varvara Kotelnikova | Supervisor: Roman Joeres
    Comparison of protein structures & chemical molecules | DOI | GitHub
  7. MCES (Maximum Common Edge Subgraph) - Proseminar
    Katja Räde | Supervisor: Roman Joeres
    Comparison of protein structures & chemical molecules | DOI (we're primarily interested in the myopic MCES, but the whole paper is interesting and worth reading) | GitHub
  8. MashMap
    Pranjali Jain | Supervisor: Guangyi Chen
    Comparison of long DNA strands | DOI | GitHub
  9. MCL
    Misbah Sayeeda Musheer | Supervisor: Amay Agrawal
    Clustering of genomes & amino scid sequences | DOI | GitHub
  10. MASH
    Max Asenow | Supervisor: Amay Agrawal
    Comparison of genomes | DOI | GitHub
  11. GClust
    Saransh Shiva Nair | Supervisor: Guangyi Chen
    Clustering of genomes | DOI | GitHub
  12. SANS
    Vishak Kadamalithaya | Supervisor: Guangyi Chen
    Clustering of genomes & amino acid sequences | DOI (algorithm update), DOI (main paper) | GitHub

Comparison means a tool computes a matrix of pairwise distances or similarities.
Clustering means a tool computes clusters of samples without telling how similar/distance these are.

Schedule

Tuesday, Oct. 8
09:00 - Welcome & Opening words
09:05 - TMalign (Johanna Straub)
09:50 - CD-HIT (Maximilian Bähr, Proseminar)
10:25 - MMSeqs (Johanna Bechher)
11:10 - Break
11:15 - FoldSeek (Zyad Ahmed)
12:00 - Weisfeiler Lehman Graph Kernel (Varvara Kotelnikova, Proseminar)
12:35 - MCES (Katja Räda, Proseminar)

13:05 - projected end

Wednesday, Oct. 9
09:00 - DIAMOND (Anastasia Lesnikov)
09:45 - MashMap (Pranjali Jain)
10:30 - MCL (Masbah Sayeeda Musheer)
11:15 - Break
11:20 - MASH (Max Asenow)
12:05 - GClust (Sarnsh Shiva Nair)
12:50 - SANS (Vishak Kadamalithaya)

13:30 - projected end