Adrian Henkel

Hey, welcome to my website!

Bits | Bites | Blog
Contact me!

Some things about me

Hey there, my name is Adrian and I am currently pursuing a Master of Science in bioinformatics at the Technical University Munich (TUM) and Ludwig-Maximilians-Universität (LMU). I have recently finished writing my masters thesis on generative diffusion language models at the Rostlab. Additionally, I am employed as a scientific assistant at Discovery Life Sciences where I focus on digital pathology (image analysis) and the creation of training videos. In my free time, I really enjoy cooking (check out Bites where I share recipes that I like) and playing basketball, and I am proud to have played for FC Bayern Basketball in my youth in the U16 National League (JBBl - please don't hate me if you don't like FC Bayern).

Work Experience

Discovery Life Sciences (scientific assistant) Project managing for an internal tutorial series for the Visiopharm software.
AstraZeneca (working student): Planning and implementation of a neural network with TensorFlow for segmentation and artifact detection on immunohistochemistry (IHC) tissue images.
Definiens AG (intern): First insights into professional software development.
Targos Molecular Pathology GmbH (intern)
Targos Molecular Pathology GmbH (administrative assistant)

Discovery Life Sciences
(2020 - ongoing)

At Discovery Life Sciences (Global Tissue Biomarker Services) I work on writing Application Protocols for the Visiopharm (Oncotopix) software. With that we aim to differentiate between specific tissue and cell types using, among others, machine learning algorithms such as neural networks, random forests or bayes classifiers. Since new users of the software had to undergo a reoccurring training process that often was time consuming I manage a project with the goal to create videos tutorial to cover the main functionalities of the software. The creation of each video undergoes a standardized process including preparation of scripts and presentations, voice and screen recordings, and video editing using Final Cut Pro X. Contact: Dr. Dirk Zielinski

AstraZeneca Computational Pathology
(2019 - 2020)

My initial exposure to programming in a corporate environment was during my time at AstraZeneca Computational Pathology, which provided me with valuable hands-on experience and exposure to real-world programming challenges. During my time there I worked on planning and implementing neural networks which aimed to detect artifacts on immunohistochemistry (IHC) tissue images. For that I worked on the implementation and adaption of the entire workflow including data generation, data augmentation, data analysis and visualization, model design, parameter fine-tuning, model evaluation and (unit) testing. The work taught me how to work more professionally with Python, including all the packages / software necessary to implement this workflow. (TensorFlow, Numpy, OpenCV, matplotlib, pytest and many more). To automate the continuous integration and deployment I used GitLabs CI/CD tools. The whole project was integrated in the agile SCRUM workflow where I participated in scrum meetings and stand-ups. Contact: Dr. Tobias Wiestler

Definiens AG
Mar. 2017

Undertaking the internship at Definiens served as a catalyst for my interest in programming and bioinformatics in general, and was instrumental in shaping my future academic pursuits. During my internship, I was first exposed to web development and data visualization with JavaScript, as well as data science techniques via Python, Anacondy and Spyder. Additionally, I had the responsibility of creating training data through the meticulous annotation of glomeruli and epithelial cells in IHC images using the Veritrova© software platform . Contact: Dr. Günter Schmidt

Targos Molecular Pathology GmbH
Oct. 2016

During the course of the internship, I acquired essential knowledge of the workflow involved in tissue biomarker development. I gained expertise in handling tissue blocks and the intricate process of fixing, cutting and staining them, followed by subsequent scanning. This provided me with an initial exposure to tissue images, which proved to be invaluable in furthering my understanding of the field.

Targos Molecular Pathology GmbH
2015 - 2016

This student job included handling the addition of customer data to the Laboratory Inventory Management System (LIMS) and undertaking a diverse range of administrative tasks. Through this responsibility, I developed a comprehensive understanding of the intricacies involved in managing data in a corporate setting, further strengthening my professional skill set.

Projects

Masters Thesis: From Hallucinating to Lucid Dreaming: Controlled Protein Design using Diffusion Language Models.
deconvsurvR: A shiny application to discover, deconvolute and analyze bulk RNA sequencing data.
Bachelors Thesis: Communication Efficient Approaches to Federated Deep Neural Networks.
Keyword Paper Miner: Search tool for scientific papers (flask application)

Masters Thesis

From Hallucinating to Lucid Dreaming: Controlled Protein Design using Diffusion Language Models. (Grade 1.0)

Summary
In this work I investigated the suitability of diffusion language models for protein design by leveraging the so called Inverse folding problem. This means that I trained a diffusion model (DiffuSeq) that is able to generate protein sequences with a specific 3D conformation. Some of the data analysis is also part of the ProstT5 paper. I also created a poster summing up the main findings of this thesis for the ISMB/ECCB 2023 conference. This can be found here.

Abstract of the thesis:
Generative diffusion models, like DALL-E and Imagen, have demonstrated impressive proficiency in controlled data generation, particularly in the realm of high-resolution imagery, guided by natural language text prompts. On a parallel track, AlphaFold2 has approached the threshold of resolving one of biology's most challenging tasks - the protein folding problem - by predicting the 3D structure of proteins with unprecedented accuracy. While the structural prediction of proteins constitutes a captivating domain, the inverse problem, namely the controlled generation of sequences that fold to a predefined 3-dimensional structure, constitutes an equally crucial scientific frontier in multiple domains. This thesis serves as a proof-of-concept, highlighting the deployment of diffusion language models (dLMs) towards the goal of controlled protein design, thereby addressing the inverse folding problem. Our workflow hinged on the application of DiffuSeq, a novel, classifier-free, sequence-to-sequence text generation method introduced by Gong et al. (2022). The structural information was previously encoded using the 3Di structural alphabet, introduced by Foldseek. This format allows the reduction of a protein's spatial conformation into an one-dimensional representation by projecting the three-dimensional characteristics of an amino acid onto one of the 20 discrete 3Di tokens. In our investigation, we first examined the dependency of the 3Di alphabet on three-state secondary structure, establishing a clear prevalence (>50%) of specific secondary structures for each 3Di alphabet letter. Thereafter, we contrasted the generative capabilities of the two Transformer architectures: Bidirectional Encoder Representations from Transformers (BERT) and Enhanced Transformer with Rotary Position Embedding (RoFormer). Next, we demonstrate the superior performance of dLMs over the semi-random baseline by consistently exceeding across all employed structure similarity metrics. However, it is important to note that, despite these promising results, we fell short of achieving the performance levels exemplified by the state-of-the-art method, ProteinMPNN. Nonetheless, given the discrete text nature of our conditional information, our findings underscore the potential of dLMs as powerful tools for the generation of sequences that effectively capture and leverage the biological context in protein design.

Main results of the preliminary best model. The RMSD is scaled with the actual values being 4.08Å 3.49Å 2.6Å 2.63Å respectively.

deconvsurvR

deconvsurvR was developed in the scope of the Advanced Practical Course Bioinformatics.
(Grade 1.0)

Summary:
deconvsurvR is a web tool for bulkRNA and single cell RNA sequencing data. The workflow contains data exploration, deconvolution, survival analysis and various machine learning algorithms. Here you can find the github repository.
This work is currently under revision and might be published in the future.

Abstract of the report:
The composition of immune cell types in tissues of different conditions is believed to influence the current status of diseases. With deconvsurvR we present a bulk and single cell analysis tool as shiny application. Besides the ability to determine immune cell type frequencies in bulk datasets using deconvolution methods, partly from the immundeconv package and in single cell dataset using clustering, the tool offers survival analysis and machine learning based on both bulk and single cell data. The results as well as the the input data can be exported as a customized result file for review and validation.

Graphical representation of the deconvsurvR workflow.

Bachelors Thesis

Communication Efficient Approaches to federated deep neural networks.
(Grade 1.0)

Summary:
In this work I have developed a simulation framework for federated learning. With that, I have evaluated three self designed communication efficient approaches on various image datasets and neural networks of different complexities. Here you can find the github repository also containing my thesis.

Abstract of the thesis:
Due to new data mining technologies and the resulting large amounts of data, automated analysis methods have to be used in order to gain valuable information about biological pathways, protein structure or tissue phenotype. Since conventional algorithms often lack the ability to generalize, machine learning, a methodology that aims to create iterative self learning computational models on centralized data, has recently provided significant success. Artificial neural networks are an advanced machine learning algorithm inspired by the biological brain, that consist of multiple interconnected layers. The training process aims to find the optimum model that achieves the best classification of the training data. Although this methodology has already been applied successfully in many fields, it is associated with privacy concerns if training is performed in a centralized manner. To tackle this challenge, federated learning was introduced, where multiple and possibly geographically independent clients collaboratively train a global model on their respective local dataset without sharing their data but only sending the model parameters to the server and vice-versa, enabling privacy-preserving model training. With increased model size, the network bandwidth that is mandatory, to exchange the model parameters between the server and the clients, becomes a bottleneck. In order to cope with this challenge, different communication-efficient approaches have been developed including gradient quantification, gradient sparsification and more local updates using the FederatedAveraging algorithm. In this thesis, we first present theoretical formulas to model the network bandwidth usage of each approach. Next, we develop a software framework to simulate a federated deep neural network training using the different communication-efficient approaches. Finally, we evaluate the approaches from the accuracy and network efficiency perspectives using multiple models and datasets under independent and identically distributed (IID) data across the clients.

Graphical representation of Federated Learning created by illustratorvanessa

Keyword Paper Miner

Keyword Paper Miner was developed during the Problem Based Learning Seminar in my Bachelor studies.
(Grade 1.0)

Summary
This Flask web application was developed in the scope of the Problem-Based-Learning seminar and aims to scrape scientific papers from PubMed and Google Scholar. We provide different ranking methods including a newly implemented K-Score developed by (Konstantinos Mitsakis). Each determined paper is presented within a tile containing the most important information including the title, the authors, the number of citations, the publishing year, the URL to the original paper and a snipped of an abstract. You can find the installation guide and the code by following this link that leads to the GitHub repository.

Example Result Paper Miner

The example result was gained with following settings:
Keyword: "ProstT5", Queried Website: Google Scholar, Rate By: K-Score, K-Score Preference: Year, N Papers: 10

Get in touch

I'd love to hear from you! Whether you have a question, feedback or just want to say hello, I'm always here to help. Fill out the form below and I'll get back to you as soon as possible. Thank you for reaching out! 😊