Skip to content


Data Science Projects

Data Science is indeed an exciting and rapidly growing field that combines various disciplines such as statistics, computer science, and domain knowledge to extract insights and knowledge from data. The field of Data Science has gained significant attention in recent years due to the explosion of data generated by digital technologies and the increasing importance of data-driven decision-making in various industries.

A collection of data science projects which utilise machine and/or deep learning, I've grouped them based on the most relevant topic for your conveniene.

Bioinformatics Projects

Bioinformatics is a fascinating field that allows us to explore the intricate workings of living organisms and make groundbreaking discoveries. By combining biology, computer science, and statistics, bioinformatics offers a unique perspective on the natural world and provides us with tools to solve complex problems. Naturally, it is a field that has me interested

  • Biopython | Bioinformatics Basics

    Open Notebook

    In this project I look at exploring the basics of the python module biopython. We look at how to define biological sequences using Seq, which allows us to work with basic DNA and protein sequence information. The library also allows us to work with more advanced sequence information using SeqRecord which allows us to include annotations and features found in the sequence. The notebook is more of an introduction to into various bioinformatics operations that can be done via biopython

    locus tag: ['YP_pPCP01'], database ref: ['GeneID:2767718'], strand: 1, location: [86:1109](+)
    locus tag: ['YP_pPCP02'], database ref: ['GeneID:2767716'], strand: 1, location: [1105:1888](+)
    locus tag: ['YP_pPCP03'], database ref: ['GeneID:2767717'], strand: 1, location: [2924:3119](+)
    locus tag: ['YP_pPCP04'], database ref: ['GeneID:2767720'], strand: 1, location: [3485:3857](+)
    locus tag: ['YP_pPCP05'], database ref: ['GeneID:2767712'], strand: 1, location: [4342:4780](+)
    locus tag: ['YP_pPCP06'], database ref: ['GeneID:2767721'], strand: -1, location: [4814:5888](-)
    locus tag: ['YP_pPCP07'], database ref: ['GeneID:2767719'], strand: 1, location: [6004:6421](+)
    locus tag: ['YP_pPCP08'], database ref: ['GeneID:2767715'], strand: 1, location: [6663:7602](+)
    locus tag: ['YP_pPCP09'], database ref: ['GeneID:2767713'], strand: -1, location: [7788:8088](-)
    locus tag: ['YP_pPCP10'], database ref: ['GeneID:2767714'], strand: -1, location: [8087:8360](-)
  • Bioconductor | Bioinformatics Basics

    Open Notebook

    In this project we look at exploring the basics of bioinformatics using bioconductor Biostrings, which allows us to work with biological sequences and msa, which can be used for sequence alignment

    AAStringSet object of length 10:
         width seq                                              names               
  • Biological Sequence Operations

    Open Notebook

    In this project we look to create python classes which allow us to work with biological sequences. Similar to the classes Seq & SeqRecord in biopython The implemented classes form the basis of future library additions, however with various additional operation options. The library allows to read and work with both FASTA and genbank formats. Do some basic exploratory data analysis of sequences & annotate different parts of the sequence. The created classes have been implemented in biopylib library


  • Biological Sequence Alignment

    Open Notebook

    Biological sequence alignment is an important problem in bioinformatics for a number of reasons, for example for understanding genetic variation: By aligning biological sequences, such as DNA or protein sequences, researchers can identify similarities and differences between different organisms or within the same organism. This helps in understanding genetic variation, evolution, and relationships between species. In this project, we create a biological sequence alignment compatible class for pairwise & multiple sequence (global and local) sequence alignment, in similar fashion to how we created the biological sequence operation related classes in biological-sequence-operations. The created classes have been implemented in biopylib library


  • Gene Classification

    Open Notebook

    In this project, we look at how to work with biological sequence data, by venturing into a machine learning classification problem in which we will be classifying between seven different genes groups common to three different species (human,chimpanzee & dog) such as Ion Channels & Transcription Factors. Each DNA segment has already been labelled for us, so all we need to do is data preprocessing, similar to how we would do it in a NLP problem, however we'll be utilising a specific to bioinformatics encoding process which works well for DNA based data. Utilising classical machine learning methods to train our model, we'll train our model on human data and see how well our model generalises on dog and chimpanzee data as well.

Any questions or comments about the above post can be addressed on the mldsai-info channel or to me directly shtrauss2, on shtrausslearning or shtrausslearning