Thesis Topics
Prerequisites
Thesises in our group usually need good knowledge in:
- mathematics and statistics, in particular in probability theory
- data mining and artificial intelligence
- strong programming skills in Java, Rust, Python
In a bachelor thesis you will usually:
- read primary literature on novel methods
- summarize foundations and reproduce the derivation of the new methods in your own words
- implement methods from the studied literature yourself
- experimentally compare the new method with reference methods
In a master thesis, we typically expect that you go beyond the state-of-the-art, for example by designing a novel extension of existing methods and comparing its properties in detail.
We typically do not offer literature research topics, and rarely offer purely theoretical topics!
We have a new thesis template available.
Open topics
There are currently few announced topics due to limited capacity. Individual topics are also possible.
- Gaussian Mixture Modeling in Rust
In this Bachelor thesis topic, the objective is to implement GMM in Rust, with emphasis on a modular and extensible code structure that supports different cluster models (elliptical, spherical) and MAP-regularization. This can be modeled following our existing Java implementation, but we expect to improve the run-time with a Rust implementation. As extension it is possible to also implement BETULA for accelerating GMM in Rust. Knowledge of and interest in Rust is necessary for this assignment! - Frequent Sequence Mining
In bachelor thesises in this area, a series of algorithms from Frequent Sequence Mining literature should be implemented and evaluated, based on the implementations of FIM algorithms in the open-source framework ELKI.- Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. Proc. 11th int. Conf. On data engineering (1995), 3–14.
- Srikant, R. and Agrawal, R. 1996. Mining sequential patterns: Generalizations and performance improvements. Proc. 5th international conference on extending database technology (EDBT) (1996), 3–17.
- Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U. and Hsu, M. 2000. FreeSpan: Frequent pattern-projected sequential pattern mining. KDD (2000), 355–359.
- Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U. and Hsu, M. 2001. PrefixSpan: Mining sequential patterns by prefix-projected growth. ICDE (2001), 215–224.
- Zaki, M.J. 2001. SPADE: An efficient algorithm for mining frequent sequences. Machine Learning. 42, 1/2 (2001), 31–60. DOI:10.1023/A:1007652502315
- Ayres, J., Flannick, J., Gehrke, J. and Yiu, T. 2002. Sequential PAttern mining using a bitmap representation. KDD (2002), 429–435.
- Fournier-Viger, P., Gomariz, A., Campos, M. and Thomas, R. 2014. Fast vertical mining of sequential patterns using co-occurrence information. Proc. 18th pacific-asia conference on advances in knowledge discovery and data mining PAKDD (2014), 40–52.
- DenClue clustering
In this bachelor thesis, the DenClue algorithm is to be implemented in Java and compared to other density-based clustering algorithms in the open-source framework ELKI.- Hinneburg, A., Keim D. A.: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD 1998: 58-65
- Hinneburg, A. Keim, D. A.: A General Approach to Clustering in Large Databases with Noise. Knowl. Inf. Syst. 5(4): 387-415 (2003)
- Hinneburg, A., & Gabriel, H. H. (2007, September). Denclue 2.0: Fast clustering based on kernel density estimation. In International symposium on intelligent data analysis (pp. 70-80). Berlin, Heidelberg: Springer Berlin Heidelberg.
- Metric Index (M-Index)
The goal of this bachelor thesis topic is to implement the M-Index for similarity search in Java, and compare it to other search indexes (e.g., iDistance, M-tree) within the open-source framework ELKI.- Novak, D., & Batko, M. (2009, August). Metric index: An efficient and scalable solution for similarity search. In 2009 Second international workshop on similarity search and applications (pp. 65-73). IEEE.
- Python wrapper für ELKI
To make it easier to compare the implementations of algorithms from ELKI and Python, the goal of this bachelor thesis is to implement a wrapper to call the Java implementations from within Python. For this python-javabridge shall be used, to avoid unnecessary copies, serialization, and deserialization of the data. By using introspection, the python APIs should be automatically generated so that it is not necessary to modify the adapter for every new algorithm. - R wrapper für ELKI
To make it easier to access ELKI algorithms from R, a wrapper based on rJava is to be implemented and evaluated. By means of introspection, the wrapper is to be automatically generated so it does not need to be manually updated for each version. Configuration and syntax should follow R conventions. (bachelor thesis topic) - Incremental Nearest Neighbor Search in Rust
In this bachelor thesis, efficient incremental nearest neighbor search is to be implemented in Rust, based on the existing Java-code in ELKI. To evaluate the performance, an algorithm based on incremental nearest neighbor search shall be implemented. The use of Rust is expected to further improve the performance over the Java version.- Schubert, E. (2022, September). Automatic indexing for similarity search in ELKI. In International Conference on Similarity Search and Applications (pp. 205-213).
- Schubert, E. (2024, October). Hierarchical Clustering Without Pairwise Distances by Incremental Similarity Search. In International Conference on Similarity Search and Applications (pp. 238-252).
- Development of a parallel HNSW construction scheme
In this Master thesis topic, you are to improve vector search. Hierarchical Navigable Small World (HNSW) graphs are one of the best performing indices for approximate nearest neighbor search. Due to finding nearest neighbors using the navigation over a hierarchy of graphs of increasing size, they achieve logarithmic query times even on very high-dimensional data. Their main bottleneck to this day remains the very costly build time, since they are typically built incrementally by inserting points in sequence. This thesis topic is about developing an efficiently parallel construction method for the hierarchy of HNSW graphs, based on a black box generator for individual hierarchy layers and a selection heuristic for more abstract hierarchy layers. The NNDescent algorithm is supposed to act as a placeholder for arbitrary black box layer creation methods. The thesis will require skills in efficient programming, primarily in the Rust programming language.
- Malkov, and Yashunin: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (https://doi.org/10.1109/TPAMI.2018.2889473)
- Dong, Charikar, and Li: Efficient k-nearest neighbor graph construction for generic similarity measures (https://doi.org/10.1145/1963405.1963487)
Previous Topics:
As examples, some earlier topics:
Bachelor Informatik
- Clustering with von-Mises-Fischer Distribution
- Choosing the number of Gaussian clusters
- Clustering with k-nearest neighbor consistency
- Accelerating Spherical k-means in Rust
- Textbasierte Empfehlungssysteme
- Erkennung von beleidigender Sprache in Twitter unter Verwendung von Maschinellem Lernen
- Accelerated Spherical k-means Clustering
- Analyse Alternativer Algorithmen für k-Means-Clustering
- Analysis and Evaluation of Approximate Nearest Neighbor Search using Hierarchical Navigable Small World Graphs
- Analysis and Evaluation of AnyDBC Clustering with Alternative Heuristics
- Compact Circular Fingerprints for Chemical Structures using Bloom Filters
- Entwicklung eines Greedy-Algorithmus zur Generierung elektrischer Niederspannungsnetze auf Basis öffentlich verfügbarer Daten
- Erstellung und Auswertung von Song-Embeddings mittels Word2Vec
- k-Means-Clustering mit k-d-Bäumen
Master Informatik, Master Data Science
- Vantage-Point Tree to HNSW: Fast Index Construction for Approximate Nearest Neighbor Search in High-Dimensional Data
- SQL Code Embeddings - Integrating Syntactical Information into Transformer Models
- Entwicklung und Evaluation schrankenbasierter DBSCAN Varianten
- On-Device Robot Localization Based on Representation Learning and a Feature-Rich Industrial Floor
- Beschleunigung von DBSCAN mithilfe der Dreiecksungleichung
- Fast Latent Dirichlet Allocation with Background Topics
- Evaluation of Feature Selection from Small Molecule Fingerprints by Means of Machine Learning
- Abstractive Text Summarization using a Reinforced Variational Autoencoder
- Efficient Approximate k-Nearest-Neighbor-Search in Large Graph Databases
- Automated Flexibility Coordination in Smart Grids Based on Neural Networks
- Leave-One-Out Optimierung von Support Vector Machines zur Ausreißererkennung
Bachelor Wirtschaftsinformatik
- no examples yet
Industry topics
I generally require the industry partner first establishes contact before promising topics to students, as many industry topic ideas do not satisfy the exam regulations. In particular, a non-disclosure agreement must be checked by the legal department, and this may take several weeks. Furthermore, I have negative prior experience with poor mentoring by industry partners. Ideally, your supervisor there is an alumni, who has previously supervised thesises – please ask your industry supervisor to first establish contact, resolve any legal contracts necessary, and reach an agreement on the topic supervision.