How the Inverted Index and Scoring Work in ElasticSearch

24 June 2020

Searching through full text fields with regexes in relational database systems like PostgreSQL or MySQL is painful: The query latency is high and your results will be unordered, so you have no idea how relevant your query results are. Elasticsearch is often the storage engine of choice for storing and querying full text data. In ElasticSearch querying fulltext fields is among the least resource intensive tasks and your query results are ordered putting the most relevant results on top.

Working with Complex Datatypes in Hive

7 June 2020 in Databases, Hive, SQL

table { width:80% !important;} The basic idea of complex datatypes is to store multiple values in a single column. So if you are working with a Hive database and you query a column, but then you notice “This value I need is trapped in a column among other values…” you just came across a complex a.k.a. nested datatype. There are three types: arrays, maps and structs. First, you have to understand, which types are present.

Plotting with Seaborn

10 September 2019 in Python, Data Visualization

Seaborn is a python library for creating plots. It is based on matplotlib and provides a high-level interface for drawing statistical graphics.

Seaborn integrates nicely with pandas: It operates on DataFrames and arrays and does aggregations and semantic mapping automatically, which makes it a quick, convenient option for data visualization in your data projects. One you understand the basic concepts, you can create plots really easily without using stack overflow too much.

Mastering Data Preparation with Pandas: Subsetting, Filtering and Joining DataFrames

18 August 2019 in Python, Data Preparation

table { width:80% !important;} When I started working with pandas I noticed that there were so many ways how to subset, filter and join data with pandas. But I was lacking a systematic overview. How do the different approaches differ and when to use which? In this blogpost we’ll look at different ways for subsetting, filtering and combining DataFrames. Subsetting Data: Selecting subsets of rows and columns by labels and positions .

Everything You Need to Know to Use Git for Version Control

17 December 2018

So many people have recommended Git as a version control system to me. I had a look at it, but I was pretty overwhelmed. Since I did not have a technical background, everything seemed so complex! Many tutorials let me copy paste code without giving you a deeper understanding of what and why I am actually doing this. This copy pasting feels like success at first, but when I tried working with it, I could not.

Automatically changing the R working directory on Mac OS to source file location

7 December 2018

This post is about how to change your R working directory. You might be wondering: Why would I want to do that? You need this as soon as your script interacts with folders on your computer. For example for imports or exports of data or figures. So probably almost always. Let's say you have a script that creates plots and saves them in the folder "Plots", which is located in your source file directory.

Formatting tables in R Markdown with kableExtra

5 November 2018

In this post, I will show you some of my best practises for formatting tables in R Markdown. We will cover How to generally format tables (font, size, color... ) How to create tables with conditional formating (e.g. coloring values < 0 red) The basics: the R package kableExtra kableExtra is an awesome package that allows you to format and style your tables. It works similar to ggplot2: You create a base table and then add formating layers with the pipe operator %>%.

R Markdown for Novices: All you need to know to get started

3 November 2018 in R

I write this blogpost for someone, who has never worked with R Markdown. After you read this post, you will understand why R Markdown may be useful for your daily work as a student, researcher, analyst or data scientist. understand the basic structure of an R Markdown document and how you can get startet. I strongly encourage everybody working with R to use R Markdown. I promise, it will make your life so much easier.

Heike Maria (PhD)