How you can structure your data sheets

After having talked about how to structure your data files in a recent newsletter edition we take a closer look at data sheets themselves. In this newsletter you will learn how to structure your data to help others  to understand your data (and therefore your project, too) and at the same time you will learn how to prepare your data for its publication.

A common way to save and organize data is done using Excel or LibreCalc etc. where you put all the collected data into various rows and columns.

As long as you or your staff are working on the files you will probably remember the meaning of the columns for some time. But in a couple of years the memory might fade and you have trouble understanding the data. This might also be the case for colleagues who want to use the data.

Using an example project, I will show how to record the important information and make the data structure more transparent in just a few steps. This is especially helpful if you want to publish the data and still understand the data in years to come.

CABS
(Status quo)

Let’s call our project “Continuum Analysis of Blood Samples” (CABS), in which blood of patient is taken on various days to document their health process. The blood samples are then analysed using different parameters.

The project is already running for some time and the project team has gathered data. Here is a look at their data sheet where they document the project.

data sheet of CABS

The first row is the table heading and each column contains various information.

There are some issues with this data sheet I would like to point out to and discuss possible ways to improve them:

  • Unification
  • Standardisation
  • Documentation

Unification

You may have noticed that the data sheet is written using German and English. Especially in the header row.

Data sheets are best written in English and the header row should be human and machine readable. That means, make it precise and omit spaces (or any non-ASCII-characters e.g. symbols).

The language locale not only affects the description but also the data itself: in English a floating point number is written with a period (.) instead of a comma (,). You see this in the column “Zellzahl” (maybe better “cell_count”). And speaking of this column one can omit the basis and the exponent (10^6) since it part of every data cell: simply write 2.247 instead of 2,247*10^6.

The columns “Aufnahme”, “Geburtstag”, “Tag 1”, “Tag 5-7”, “Tag 12-14” contain a date. But when you look closely you see the way of writing the date changes after some rows. First it is a German-like description then some sort of an English way. Be consistent and use an uniform way of writing a date.

There is actually one standard way of writing dates you will learn about in the next paragraph.

Standardisation

First, let’s talk about dates as a very important topic regarding standardisation. Secondly we will take a look at ways to get standardised values.

Dates

Writing dates can be confusing but actually there is only one way writing a date properly (ISO 8601) and visualized in this comic:


CC-NC 2.5, xkcd, https://xkcd.com/1179/

Controlled vocabulary

You can easily make sure your data is uniform and standardised (throughout your document) when you use a controlled vocabulary.

Let’s look at column “Sample handling” of the CABS-data sheet. There are actually only two accepted values for this column: Frozen and Fresh. But due to spelling errors or different people entering data and having a different way of describing the sample handling there are values like FRESH, liquid, Froen, frsh, deep frozen.

This is an error to your data and might lead to problems when you examine or do some calculations with it.
Introducing a controlled vocabulary will make sure you can only select the value Frozen or Fresh.

Setting up a controlled vocabulary (dropdown list) can be easily done in Excel or LibreCalc.

Let us take another look at column “Group”. Since there are also different values stored ARDS, non-ards, non ARDS, ards one might think of using a controlled vocabulary, too. But the value of this column can actually be described as a Boolean expression: true or false, either it belongs to “ARDS” or not. So, every ARDS or ards will be true and
non-ards or non ARDS will be replaced by false.

By replacing an individual value with a common expression you have made another standardisation.

Documentation

CABS (improved)

Great. Now we have improved our data sheet a lot. Here is how it looks like after some changes.

data sheet revised

What did I do? First of all, I changed the font to a monospaced font which improves the readability of the values in the columns. You can use the font e.g. Courier.

Second, I also introduced a new column called UUID (Universally Unique Identifier) which contains an unique value for each row (it is not a proper UUID but works for our purpose here). This is an important feature especially for processing your data since each row has an unique identifier. This local UUID consists of a string with five elements (a-z0-9) and can be created using a web service such as random.org or by your system (e.g. with bash).

Even with such a short UUID consisting of 5 elements you can create up 33,390,720 permutations without any duplicate. Using 4 elements will still make 1,113,024 permutations, with 2 elements you can get 1,122 permutations (credits: stattrek.com).

And third, I renamed all columns (all in English now) and rearranged them, depending on their area of origin (global, patient or sample).

Ok, so now we improved the data organisation but we still need to document the data for making them understandable. Here is its documentation.

In the first column you find the fields of the header. In the second column I added the former field names for the sake of completeness. It is very important to have a column for the description of the data. Just a few words to help understanding the data. The column “category” is also reflected in the field name (e.g. patient_ or sample_ ).

With column “controlled vocabulary” we document with true or false whether the value comes from a dropdown list. The possible values are listed in the column “values”. This column contains the character of all the values (e.g. kind of number – float or integer – or the schema for dates – YYYY-MM-DD – and so on).

In the column “origin” there is the information stored where the value comes from and/or how it was generated.

This is still just an exemplary project and you can add more columns to the documentation sheet if you wish and when you think you can provide more information to certain fields.

Summary

In this newsletter you have learned about different possibilities to improve your data. When you take a look at your data the next time think about these three steps (unification, standardisation and documentation). With some effort and committment your data will profit a lot.

If you like some assistance and discuss your data, contact your data steward.

This is the newsletter of the CRC 1382 in which regularly topics regarding (good) research data managements are discussed.

The information provided are selected by the data steward Dr. Lukas C. Bossert.
They are tailored to meet the standards and requirements of the UKA and RWTH.

If you think that the tips and tricks provided do not fit to your data I would be happy to discuss it and take a look at your data and its organization.