Data Analytics - Data Preparation using R
Data preparation is an important step in data science. Before you can start analysis, we need to ingest the required data and then clean and transform it so that it becomes more easy to perform analytics.
R is an open source and powerful software having reliable packages and libraries to perform data manipulation tasks.
I have downloaded the clinical trials data from ACCT website and used that to show data preparation steps. You can reference the complete code published here in a R Markdown document:
Let's walk through the code and understand how to do it. You may want to open the link mentioned above in a separate tab since I will not mention the code here again and will simply refer to the document.
You would need to first download the data files from ACCT website download tab and unzip them into a folder in your local directory.
Installing the required packages is the first step. 'dplyr' is an important library package to perform data manipulation and 'lubridate' provides important functions to perform operations with date columns.
Save the directory paths into variables. Notice that the '\' in the path are replaced with '/'.
read.csv function is used to read the pipe delimited data files into dataframes. Another option is to use read.table function can also be used but I found read.csv works better for me. I encountered errors reading records but read.csv worked without any issues for the same data set.
studies table has lots of columns that I did not require for the analysis, so I created a subset with the required columns only.
subset.data.frame function allows to select the required columns using 'select' parameter. You can also provide any data filter using 'subset' parameter specifying the filter condition if you want a subset of rows.
I need to create new formula columns into the subset of studies created. mutate function helps to do that.
Another function which I found helpful in doing a wildcard search is 'grepl' function.
summarise function is helpful in creating aggregations.
Hope you will find this useful.