Description

In bioinformatics, the comparison and clustering of biological molecules serve as essential methodologies for deciphering the intricate complexities of biological systems. These processes play a pivotal role in elucidating evolutionary relationships, functional similarities, and structural motifs across diverse organisms and biomolecules. Through sequence alignment, structural superposition, and clustering techniques, researchers can discern patterns, similarities, and differences within vast datasets of DNA, RNA, proteins, and small molecules. Such analyses not only aid in understanding the fundamental mechanisms governing cellular processes but also pave the way for advancements in drug discovery, protein engineering, and personalized medicine. Moreover, by organizing biological data into meaningful clusters based on shared characteristics, these methodologies streamline data interpretation, enabling researchers to extract valuable insights and make informed decisions in their quest to unravel the mysteries of life at the molecular level.

In this seminar, we will look at two groups of tools. The first group computes clusters directly while the second group computes pairwise inter-sample similarities or distances that can then be used by classical clustering algorithms such as spectral clustering or agglomerative clustering. The goal for this seminar is to equip participants with a comprehensive understanding of the principles, methodologies, and practical applications of comparing and clustering biological molecules using cutting-edge tools such as FoldSeek, MASH, and CD-HIT. By the end of the seminar, participants should be able to proficiently utilize these tools to perform sequence alignment, structural comparison, and clustering analyses on various biological datasets. Moreover, they should gain insights into how these analyses contribute to advancements in genomics, proteomics, and drug discovery.

✦ ✦ ✦

Requirements

This (pro-)seminar has no formal requirements.

Team

What Do You Need To Do?

How Are the Grades Computed?

Presentation
40%
Report
40%
Report
20%

Presentation (40%)

Assessed on clarity, depth of understanding, and the quality of your answers to audience questions.

Report (40%)

The report should include:

Participation (20%)

Active engagement during other students' presentations. Ask questions — it counts toward your grade and improves the seminar for everyone.

Plagiarism

We will check every submission for plagiarism with TurnItIn. This is an online tool automatically checking submissions for plagiarism. You are free (and encouraged) to use it before submitting your final report. Following the link above, you can login with your UdS-credentials (as you use for the students email) and use TurnItIn for free. With attendance of this seminar, you agree that we upload your report to TurnItIn.

If we detect plagiarism in your work, you will have the chance to explain yourself. Ultimately, you will fail this seminar if your explanation is not convincing.

Registration

Please register to this seminar by writing an email to Roman Joeres before 19.04.2024 23:59. Please also attach your transcript of records which can be downloaded from the LSF/HISPOS. We will distribute the topics among students in the mandatory-to-attend kickoff meeting at 23.04.2024 12 PM in E2.1 SR 007.

Organisational Details

📅 The seminar will be held as a block seminar at the end of September or beginning of October 2024.

🎓 Bachelor students in their proseminar earn 5 credit points (CP).

🎓 Students in their seminar earn 7 credit points (CP).

👥 Maximum number of participants: 12.
We guarantee 4 proseminar seats and 8 seminar seats. If one quota is not reached, the seats will be filled from the other group.

🌐 (Pro-)Seminar language: English.

⭐ Bioinformatics students and students with no prior (pro-)seminar will be preferred.

📋 Registration in LSF/HISPOS will be announced in due course.

In case of questions, feel free to contact one of the team members listed above, non-professors usually respond faster ;-).

Important Dates

[19.04.2024, 17:59] Registration deadline
[23.04.2024, 12:00] Kick-off meeting — room 007 in E2.1 (CBI building)
[27.09.2024, 23:59] Submission deadline for final draft of slide set
[08./09.10.2024] Presentation days
[25.10.2024, 23:59] Submission of final report

Topics

  1. DIAMOND - Comparison of amino-acid sequences Assigned to: Anastasia Lesnikov, Supervisor: Roman Joeres
  2. MMseqs2 - Comparison & Clustering of amino-acid sequences Assigned to: Johanna Becher, Supervisor: Amay Agrawal
  3. TM-align - Comparison of amino-acid sequences Assigned to: Johanna Straub, Supervisor: Guangyi Chen
  4. CD-HIT - Clustering of amino-acid sequences & DNA/RNA Assigned to: Maximilian Bähr, Supervisor: Amay Agrawal
  5. FoldSeek - Comparison of protein structures Assigned to: Zyad Ahmed, Supervisor: Roman Joeres
  6. Weisfeiler-Lehman Graph Kernels - Comparison of protein structures & chemical molecules Assigned to: Varvara Kotelnikova, Supervisor: Roman Joeres
  7. MCES (Maximum Common Edge Subgraph) - Comparison of protein structures & chemical molecules Assigned to: Katja Räde, Supervisor: Roman Joeres
  8. MashMap - Comparison of long DNA strands Assigned to: Pranjali Jain, Supervisor: Guangyi Chen
  9. MCL - Clustering of genomes & amino acid sequences Assigned to: Misbah Sayeeda Musheer, Supervisor: Amay Agrawal
  10. MASH - Comparison of genomes Assigned to: Max Asenow, Supervisor: Amay Agrawal
  11. GClust - Comparison of genomes Assigned to: Saransh Shiva Nair, Supervisor: Guangyi Chen
  12. SANS - Clustering of genomes & amino acid sequences Assigned to: Vishak Kadamalithaya, Supervisor: Guangyi Chen

Comparison means a tool computes a matrix of pairwise distances or similarities. Clustering means a tool computes clusters of samples without telling how similar/distance these are.

Schedule

[08.10., 09:00] Welcome & Opening Remarks
[08.10., 09:05] TMalign (Johanna Straub)
[08.10., 09:50] CD-HIT (Maximilian Bähr, Proseminar)
[08.10., 10:25] MMSeqs (Johanna Bechher)
[08.10., 11:10] Break
[08.10., 11:15] FoldSeek (Zyad Ahmed)
[08.10., 12:00] Weisfeiler Lehman Graph Kernel (Varvara Kotelnikova, Proseminar)
[08.10., 12:35] MCES (Katja Räda, Proseminar)
[09.10., 09:00] DIAMOND (Anastasia Lesnikov)
[09.10., 0945] MashMap (Pranjali Jain)
[09.10., 10:30] MCL (Masbah Sayeeda Musheer)
[09.10., 11:15] Break
[09.10., 11:20] MASH (Max Asenow)
[09.10., 12:05] GClust (Sarnsh Shiva Nair)
[09.10., 12:50] SANS (Vishak Kadamalithaya)