Aggregating Data With Uniform structures

Introduction

Multiple times I have found myself needing to append data from different source files with the same structure. For example, in a previous blog post I did some analysis on temperature probes we left out in Lordsburg New Mexico. These probes had the ability to generate and dump CSV files, and since they’re all the same model the structures are identical. Tools like octoparse make scaping and dumping data into csv or table files easy and intuitive, and if you’re obtaining data from the same website odds are your structure will be uniform.

I’ll guide you through the process of appending multiple data files with identical formats into one master data file in lieu of having to spend your time copying and pasting cell ranges in excel. With a little more work, you can even write an R script that automates this task for you! Let’s get going.

Probe Data Sources

In our use case, we currently have 17 separate xls files in a directory.

Their structures are identical, and are as follows:

With some rudimentary knowledge of programming, writing an R script to join these files is pretty trivial. However, we do have to be careful with a couple of aspects when we’re writing the script:

If you plan on moving the R script relative to the files you’ll be joining, it’s critical to include the absolute path.
If you plan on only moving the folder, but not displacing any of the files relative to each other you can include the relative path.
If we expect future data files to be of the same extension, or delimited in the exact same manner, we can source the data by pulling only file names in a directory that possess said extension (.csv, .xls, .xlsx). If the file extensions vary, but the structures remain the same you can still join the files with ease, but they might require different functions to read the separate extensions.

We can obtain the filenames in the target directory with base R, or use the stringr package:

### Ignore next line, I'm knitting this markdown in a separate directory
setwd("/Users/jessecaro/Desktop/AppendingData/")
##Base R: 
filevec <- list.files() %>% grepl(".xls", .) %>% list.files()[.]
print("---")

## [1] "---"

##Stringr:
library(stringr)
filevec <- list.files() %>% str_subset(".xls")

Before we begin joining, it’s worth taking a look at the structure of the data files. Let’s initialize the relative path to the folders where files reside as a string called abspath. Additionally, we must filter out the first 27 lines (recall the source file image above).

abspath = "/Users/jessecaro/Desktop/AppendingData/"
df <- read_xls(path = paste0(abspath,filevec[1]))[-1 * 1:27,]

## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6

## Let's take a glance
head(df)

## # A tibble: 6 x 6
##   `Data Report` ...2             ...3            ...4           ...5  ...6 
##   <chr>         <chr>            <chr>           <chr>          <chr> <chr>
## 1 No.           Time             Temperature°C   Humidity%      <NA>  <NA> 
## 2 1             2019-03-19 12:0… 21.39999999999… 31.3999999999… <NA>  <NA> 
## 3 2             2019-03-19 12:1… 30.30000000000… 19.1999999999… <NA>  <NA> 
## 4 3             2019-03-19 12:3… 31.39999999999… 16.8999999999… <NA>  <NA> 
## 5 4             2019-03-19 12:4… 36              16.3000000000… <NA>  <NA> 
## 6 5             2019-03-19 13:0… 34.5            16             <NA>  <NA>

Next, we have to promote the first row to headers discard the row, and trim the latter two columns.

names(df) <- df[1,] %>% as.vector()
df <- df[-1,]
df <- df[,-c(5,6)]
head(df)

## # A tibble: 6 x 4
##   No.   Time                `Temperature°C`    `Humidity%`       
##   <chr> <chr>               <chr>              <chr>             
## 1 1     2019-03-19 12:00:29 21.399999999999999 31.399999999999999
## 2 2     2019-03-19 12:15:29 30.300000000000001 19.199999999999999
## 3 3     2019-03-19 12:30:29 31.399999999999999 16.899999999999999
## 4 4     2019-03-19 12:45:29 36                 16.300000000000001
## 5 5     2019-03-19 13:00:29 34.5               16                
## 6 6     2019-03-19 13:15:29 33                 15.699999999999999

Let’s put it all together! Using rbind you can repeatedly append each new file onto the end of the “master” data file. We’ll use a loop to do this, looping through the name of each data file in our directory with an .xls extension. First, we need to declare an empty dataframe that will take in data from each file.

masterdataframe <- data.frame()
for(filenm in filevec){
  abspath = "/Users/jessecaro/Desktop/AppendingData/"
  df <- read_xls(path = paste0(abspath,filenm))[-1 * 1:27,]
  names(df) <- df[1,] %>% as.vector()
  df <- df[-1,]
  df <- df[,-c(5,6)]
  masterdataframe <- masterdataframe %>% rbind(df)
}

That’s pretty much it! We can go ahead and add an additional column that states which source file each row came from. In this particular case, each data file comes from an individual probe that is numbered. We can also go a step further, and dump the file as a csv into a subdirectory.

masterdataframe <- data.frame()
for(filenm in filevec){
  abspath = "/Users/jessecaro/Desktop/AppendingData/"
  df <- read_xls(path = paste0(abspath,filenm))[-1 * 1:27,]
  names(df) <- df[1,] %>% as.vector()
  df <- df[-1,]
  df <- df[,-c(5,6)]
  ### Include filename of source
  df$sourcenm <- filenm
  masterdataframe <- masterdataframe %>% rbind(df)
}
### Write joined file as a CSV, without row names (row numbers).
write.csv(masterdataframe, file = "/Users/jessecaro/Desktop/AppendingData/Joined/Master.csv",row.names = FALSE)

J Caro's Blog

J Caro's Blog

Aggregating Data With Uniform structures

Introduction

Probe Data Sources