(Pro-)Seminar:
Comparison and Clustering of Biological Molecules

Description

In bioinformatics, the comparison and clustering of biological molecules serve as essential methodologies for deciphering the intricate complexities of biological systems. These processes play a pivotal role in elucidating evolutionary relationships, functional similarities, and structural motifs across diverse organisms and biomolecules. Through sequence alignment, structural superposition, and clustering techniques, researchers can discern patterns, similarities, and differences within vast datasets of DNA, RNA, proteins, and small molecules. Such analyses not only aid in understanding the fundamental mechanisms governing cellular processes but also pave the way for advancements in drug discovery, protein engineering, and personalized medicine. Moreover, by organizing biological data into meaningful clusters based on shared characteristics, these methodologies streamline data interpretation, enabling researchers to extract valuable insights and make informed decisions in their quest to unravel the mysteries of life at the molecular level.

In this seminar, we will look at two groups of tools. The first group computes clusters directly while the second group computes pairwise inter-sample similarities or distances that can then be used by classical clustering algorithms such as spectral clustering or agglomerative clustering. The goal for this seminar is to equip participants with a comprehensive understanding of the principles, methodologies, and practical applications of comparing and clustering biological molecules using cutting-edge tools such as FoldSeek, MASH, and CD-HIT. By the end of the seminar, participants should be able to proficiently utilize these tools to perform sequence alignment, structural comparison, and clustering analyses on various biological datasets. Moreover, they should gain insights into how these analyses contribute to advancements in genomics, proteomics, and drug discovery.

Requirements

This (pro-)seminar has no formal requirements.

Team

Prof. Dr. Olga Kalinina

M.Sc. Roman Joeres

M.Sc. Guangyi Chen

M.Sc. Amay Agrawal

What Do You Need To Do In The Seminar?

Read and present one of the seminar papers (as assigned in the kick-off meeting).
Bachelor students in their proseminar have to
- give a 20 minutes presentation with additional 10 minutes discussion and
- write a 5-pages-long report about their method
Students in their seminar have to
- give a 30 minutes presentation with additional 10 minutes discussion and
- write a 7-pages-long report about their method
You need to attend all talks and participate in discussions.

How Are The Grades Computed?

40% for the presentation and the following discussion

40% for the report. The report should include

○

A short explanation of how the algorithms in your paper.
○

What are its (dis-)advantages over other tools presented in the seminar?
○

Your own thoughts on this method

20% for your participation in the discussion of other presentations. Therefore: Ask questions!

Plagiarism

We will check every submission for plagiarism with TurnItIn. This is an online tool automatically checking submissions for plagiarism. You are free (and encouraged) to use it before submitting your final report. Following the link above, you can login with your UdS-credentials (as you use for the students email) and use TurnItIn for free. With attendance of this seminar, you agree that we upload your report to TurnItIn.
If we detect plagiarism in your work, you will have the chance to explain yourself. Ultimately, you will fail this seminar if your explanation is not convincing.

Registration

Please register to this seminar by writing an email to Roman Joeres before 19.04.2023 23:59. Please also attach your transcript of records which can be downloaded from the LSF/HISPOS. We will distribute the topics among students in the mandatory-to-attend kickoff meeting at 23.04.2024 12 PM in E2.1 SR 007.

Other Organizational Things

The seminar will be held as a block seminar at the end of September or beginning of October 2024).
Bachelor students in their proseminar will earn 5 CP.
Students doing this as a seminar will earn 7 CP.
The number of participants is limited to a maximum of 12.
We guarantee 4 proseminar seats and 8 seminar seats. If one quota is not reached, the seats will be filled from the other group.
The (pro-)seminar language is english.
Bioinformatics students and students with no prior (pro-)seminar will be preferred.
Registration in LSF/HISPOS will open after the kick-off meeting on 23.04.2024.
In case of questions, feel free to contact one of the team members listed above, non-professors usually respond faster ;-).

Important Dates

19.04.2024 17:59 - Registration deadline
23.04.2024 12:00 - Kick-Off Meeting in room 007 in E2.1 (CBI building)
27.09.2024 23:59 - Submission deadline for the final draft of the slide set.
08./09.10.2024 - Presentations days
25.10.2024 23:59 - Submission of report

Topics

DIAMOND
Anastasia Lesnikov | Supervisor: Roman Joeres
Comparison of amino-acid sequences | DOI | GitHub
MMseqs2
Johanna Becher | Supervisor: Amay Agrawal
Comparison & Clustering of amino-acid sequences | DOI | GitHub
TM-align
Johanna Straub | Supervisor: Guangyi Chen
Comparison of amino-acid sequences | DOI | GitHub
CD-HIT - Proseminar
Maximilian Bähr| Supervisor: Amay Agrawal
Clustering of amino-acid sequences & DNA/RNA | DOI | GitHub
FoldSeek
Zyad Ahmed | Supervisor: Roman Joeres
Comparison of protein structures | DOI | GitHub
Weisfeiler-Lehman Graph Kernels - Proseminar
Varvara Kotelnikova | Supervisor: Roman Joeres
Comparison of protein structures & chemical molecules | DOI | GitHub
MCES (Maximum Common Edge Subgraph) - Proseminar
Katja Räde | Supervisor: Roman Joeres
Comparison of protein structures & chemical molecules | DOI (we're primarily interested in the myopic MCES, but the whole paper is interesting and worth reading) | GitHub
MashMap
Pranjali Jain | Supervisor: Guangyi Chen
Comparison of long DNA strands | DOI | GitHub
MCL
Misbah Sayeeda Musheer | Supervisor: Amay Agrawal
Clustering of genomes & amino scid sequences | DOI | GitHub
MASH
Max Asenow | Supervisor: Amay Agrawal
Comparison of genomes | DOI | GitHub
GClust
Saransh Shiva Nair | Supervisor: Guangyi Chen
Clustering of genomes | DOI | GitHub
SANS
Vishak Kadamalithaya | Supervisor: Guangyi Chen
Clustering of genomes & amino acid sequences | DOI (algorithm update), DOI (main paper) | GitHub

Comparison means a tool computes a matrix of pairwise distances or similarities.
Clustering means a tool computes clusters of samples without telling how similar/distance these are.

Schedule

Tuesday, Oct. 8
09:00 - Welcome & Opening words
09:05 - TMalign (Johanna Straub)
09:50 - CD-HIT (Maximilian Bähr, Proseminar)
10:25 - MMSeqs (Johanna Bechher)
11:10 - Break
11:15 - FoldSeek (Zyad Ahmed)
12:00 - Weisfeiler Lehman Graph Kernel (Varvara Kotelnikova, Proseminar)
12:35 - MCES (Katja Räda, Proseminar)
13:05 - projected end

Wednesday, Oct. 9
09:00 - DIAMOND (Anastasia Lesnikov)
09:45 - MashMap (Pranjali Jain)
10:30 - MCL (Masbah Sayeeda Musheer)
11:15 - Break
11:20 - MASH (Max Asenow)
12:05 - GClust (Sarnsh Shiva Nair)
12:50 - SANS (Vishak Kadamalithaya)
13:30 - projected end

(Pro-)Seminar:Comparison and Clustering of Biological Molecules