(Pro-)Seminar: Comparison and Clustering of Biological Molecules

Description

In bioinformatics, the comparison and clustering of biological molecules serve as essential methodologies for deciphering the intricate complexities of biological systems. These processes play a pivotal role in elucidating evolutionary relationships, functional similarities, and structural motifs across diverse organisms and biomolecules. Through sequence alignment, structural superposition, and clustering techniques, researchers can discern patterns, similarities, and differences within vast datasets of DNA, RNA, proteins, and small molecules. Such analyses not only aid in understanding the fundamental mechanisms governing cellular processes but also pave the way for advancements in drug discovery, protein engineering, and personalized medicine. Moreover, by organizing biological data into meaningful clusters based on shared characteristics, these methodologies streamline data interpretation, enabling researchers to extract valuable insights and make informed decisions in their quest to unravel the mysteries of life at the molecular level.

In this seminar, we will look at two groups of tools. The first group computes clusters directly while the second group computes pairwise inter-sample similarities or distances that can then be used by classical clustering algorithms such as spectral clustering or agglomerative clustering. The goal for this seminar is to equip participants with a comprehensive understanding of the principles, methodologies, and practical applications of comparing and clustering biological molecules using cutting-edge tools such as FoldSeek, MASH, and CD-HIT. By the end of the seminar, participants should be able to proficiently utilize these tools to perform sequence alignment, structural comparison, and clustering analyses on various biological datasets. Moreover, they should gain insights into how these analyses contribute to advancements in genomics, proteomics, and drug discovery.

✦ ✦ ✦

Requirements

This (pro-)seminar has no formal requirements.

Team

What Do You Need To Do?

Read and present one of the assigned seminar papers (assignment made at the kick-off meeting).
Bachelor students in their proseminar have to

Give a 20 minutes presentation with additional 10 minutes discussion and
Write a 5-pages-long report about their method.

Students in their seminar have to

Give a 30 minutes presentation with additional 10 minutes discussion and
Write a 7-pages-long report about their method.

You need to attend all talks and participate in discussions.

How Are the Grades Computed?

Presentation

40%

Report

40%

Report

20%

Presentation (40%)

Assessed on clarity, depth of understanding, and the quality of your answers to audience questions.

Report (40%)

The report should include:

A short explanation of how the the algorithm in your paper works.
What are its (dis-)advantages over other tools presented in the seminar?
Your own thoughts on this method.

Participation (20%)

Active engagement during other students' presentations. Ask questions — it counts toward your grade and improves the seminar for everyone.

Plagiarism

We will check every submission for plagiarism with TurnItIn. This is an online tool automatically checking submissions for plagiarism. You are free (and encouraged) to use it before submitting your final report. Following the link above, you can login with your UdS-credentials (as you use for the students email) and use TurnItIn for free. With attendance of this seminar, you agree that we upload your report to TurnItIn.

If we detect plagiarism in your work, you will have the chance to explain yourself. Ultimately, you will fail this seminar if your explanation is not convincing.

Registration

Please register to this seminar by writing an email to Roman Joeres before 19.04.2024 23:59. Please also attach your transcript of records which can be downloaded from the LSF/HISPOS. We will distribute the topics among students in the mandatory-to-attend kickoff meeting at 23.04.2024 12 PM in E2.1 SR 007.

Organisational Details

📅 The seminar will be held as a block seminar at the end of September or beginning of October 2024.

🎓 Bachelor students in their proseminar earn 5 credit points (CP).

🎓 Students in their seminar earn 7 credit points (CP).

👥 Maximum number of participants: 12.
We guarantee 4 proseminar seats and 8 seminar seats. If one quota is not reached, the seats will be filled from the other group.

🌐 (Pro-)Seminar language: English.

⭐ Bioinformatics students and students with no prior (pro-)seminar will be preferred.

📋 Registration in LSF/HISPOS will be announced in due course.

In case of questions, feel free to contact one of the team members listed above, non-professors usually respond faster ;-).

Important Dates

[19.04.2024, 17:59]	Registration deadline
[23.04.2024, 12:00]	Kick-off meeting — room 007 in E2.1 (CBI building)
[27.09.2024, 23:59]	Submission deadline for final draft of slide set
[08./09.10.2024]	Presentation days
[25.10.2024, 23:59]	Submission of final report

Topics

DIAMOND - Comparison of amino-acid sequences Assigned to: Anastasia Lesnikov, Supervisor: Roman Joeres
DOI GitHub
MMseqs2 - Comparison & Clustering of amino-acid sequences Assigned to: Johanna Becher, Supervisor: Amay Agrawal
DOI GitHub
TM-align - Comparison of amino-acid sequences Assigned to: Johanna Straub, Supervisor: Guangyi Chen
DOI GitHub
CD-HIT - Clustering of amino-acid sequences & DNA/RNA Assigned to: Maximilian Bähr, Supervisor: Amay Agrawal
DOI GitHub
FoldSeek - Comparison of protein structures Assigned to: Zyad Ahmed, Supervisor: Roman Joeres
DOI GitHub
Weisfeiler-Lehman Graph Kernels - Comparison of protein structures & chemical molecules Assigned to: Varvara Kotelnikova, Supervisor: Roman Joeres
PDF GitHub
MCES (Maximum Common Edge Subgraph) - Comparison of protein structures & chemical molecules Assigned to: Katja Räde, Supervisor: Roman Joeres
DOI GitHub
MashMap - Comparison of long DNA strands Assigned to: Pranjali Jain, Supervisor: Guangyi Chen
DOI GitHub
MCL - Clustering of genomes & amino acid sequences Assigned to: Misbah Sayeeda Musheer, Supervisor: Amay Agrawal
DOI GitHub
MASH - Comparison of genomes Assigned to: Max Asenow, Supervisor: Amay Agrawal
DOI GitHub
GClust - Comparison of genomes Assigned to: Saransh Shiva Nair, Supervisor: Guangyi Chen
DOI GitHub
SANS - Clustering of genomes & amino acid sequences Assigned to: Vishak Kadamalithaya, Supervisor: Guangyi Chen
DOI (update) DOI (main paper) GitHub

Comparison means a tool computes a matrix of pairwise distances or similarities. Clustering means a tool computes clusters of samples without telling how similar/distance these are.

Schedule

[08.10., 09:00]	Welcome & Opening Remarks
[08.10., 09:05]	TMalign (Johanna Straub)
[08.10., 09:50]	CD-HIT (Maximilian Bähr, Proseminar)
[08.10., 10:25]	MMSeqs (Johanna Bechher)
[08.10., 11:10]	Break
[08.10., 11:15]	FoldSeek (Zyad Ahmed)
[08.10., 12:00]	Weisfeiler Lehman Graph Kernel (Varvara Kotelnikova, Proseminar)
[08.10., 12:35]	MCES (Katja Räda, Proseminar)
[09.10., 09:00]	DIAMOND (Anastasia Lesnikov)
[09.10., 0945]	MashMap (Pranjali Jain)
[09.10., 10:30]	MCL (Masbah Sayeeda Musheer)
[09.10., 11:15]	Break
[09.10., 11:20]	MASH (Max Asenow)
[09.10., 12:05]	GClust (Sarnsh Shiva Nair)
[09.10., 12:50]	SANS (Vishak Kadamalithaya)