How you can structure your data sheets
A common way to save and organize data is done using Excel or LibreCalc etc. where you put all the collected data into various rows and columns.
As long as you or your staff are working on the files you will probably remember the meaning of the columns for some time. But in a couple of years the memory might fade and you have trouble understanding the data. This might also be the case for colleagues who want to use the data.
Using an example project, I will show how to record the important information and make the data structure more transparent in just a few steps. This is especially helpful if you want to publish the data and still understand the data in years to come.
Let’s call our project “Continuum Analysis of Blood Samples” (CABS), in which blood of patient is taken on various days to document their health process. The blood samples are then analysed using different parameters.
The project is already running for some time and the project team has gathered data. Here is a look at their data sheet where they document the project.
You may have noticed that the data sheet is written using German and English. Especially in the header row.
Data sheets are best written in English and the header row should be human and machine readable. That means, make it precise and omit spaces (or any non-ASCII-characters e.g. symbols).
The language locale not only affects the description but also the data itself: in English a floating point number is written with a period (
.) instead of a comma (
,). You see this in the column “Zellzahl” (maybe better “cell_count”). And speaking of this column one can omit the basis and the exponent (
10^6) since it part of every data cell: simply write
2.247 instead of
The columns “Aufnahme”, “Geburtstag”, “Tag 1”, “Tag 5-7”, “Tag 12-14” contain a date. But when you look closely you see the way of writing the date changes after some rows. First it is a German-like description then some sort of an English way. Be consistent and use an uniform way of writing a date.
There is actually one standard way of writing dates you will learn about in the next paragraph.
First, let’s talk about dates as a very important topic regarding standardisation. Secondly we will take a look at ways to get standardised values.
Writing dates can be confusing but actually there is only one way writing a date properly (ISO 8601) and visualized in this comic:
CC-NC 2.5, xkcd, https://xkcd.com/1179/
You can easily make sure your data is uniform and standardised (throughout your document) when you use a controlled vocabulary.
Let’s look at column “Sample handling” of the CABS-data sheet. There are actually only two accepted values for this column:
Fresh. But due to spelling errors or different people entering data and having a different way of describing the sample handling there are values like
This is an error to your data and might lead to problems when you examine or do some calculations with it.
Introducing a controlled vocabulary will make sure you can only select the value
Setting up a controlled vocabulary (dropdown list) can be easily done in Excel or LibreCalc.
Let us take another look at column “Group”. Since there are also different values stored
ards one might think of using a controlled vocabulary, too. But the value of this column can actually be described as a Boolean expression:
false, either it belongs to “ARDS” or not. So, every
ards will be
non ARDS will be replaced by
By replacing an individual value with a common expression you have made another standardisation.
Great. Now we have improved our data sheet a lot. Here is how it looks like after some changes.
What did I do? First of all, I changed the font to a monospaced font which improves the readability of the values in the columns. You can use the font e.g. Courier.
Second, I also introduced a new column called
UUID (Universally Unique Identifier) which contains an unique value for each row (it is not a proper UUID but works for our purpose here). This is an important feature especially for processing your data since each row has an unique identifier. This local UUID consists of a string with five elements (
a-z0-9) and can be created using a web service such as random.org or by your system (e.g. with bash).
Even with such a short UUID consisting of 5 elements you can create up
33,390,720permutations without any duplicate. Using 4 elements will still make
1,113,024permutations, with 2 elements you can get
1,122permutations (credits: stattrek.com).
And third, I renamed all columns (all in English now) and rearranged them, depending on their area of origin (global, patient or sample).
Ok, so now we improved the data organisation but we still need to document the data for making them understandable. Here is its documentation.
In the first column you find the fields of the header. In the second column I added the former field names for the sake of completeness. It is very important to have a column for the description of the data. Just a few words to help understanding the data. The column “category” is also reflected in the field name (e.g.
With column “controlled vocabulary” we document with
false whether the value comes from a dropdown list. The possible values are listed in the column “values”. This column contains the character of all the values (e.g. kind of number –
integer – or the schema for dates –
YYYY-MM-DD – and so on).
In the column “origin” there is the information stored where the value comes from and/or how it was generated.
This is still just an exemplary project and you can add more columns to the documentation sheet if you wish and when you think you can provide more information to certain fields.
In this newsletter you have learned about different possibilities to improve your data. When you take a look at your data the next time think about these three steps (unification, standardisation and documentation). With some effort and committment your data will profit a lot.
If you like some assistance and discuss your data, contact your data steward.