Using R to Download and Unzip Files for Visualization

Automation & R – Part 1

This is the first blog post in a series of posts where we will focus on dynamically downloading and unzipping files from a website. We will then explore the cleaning and visualization of the data, as well as model and interpret it.

Introduction

Manually downloading and unzipping files to update your data is tedious. Throughout the next few blogs, we will explore how to automatically download and unzip files to speed up the process of obtaining your data, so you can get to the fun part: creating visualizations! Data science follows a pipeline that we refer to using the acronym O.S.E.M.N., given to us by Hilary Mason and Chris Wiggins. The steps are as followed: 

  • O - Obtaining the data

  • S - Scrubbing/Cleaning the data

  • E - Exploring/Visualizing the data

  • M - Modeling the data

  • N - Interpreting the data

This post deals with the first step of this pipeline: obtaining the data.

Obtaining the Data

With most data sources, data is presented in some sort of compressed file format such as a .zip or .gzip. Inside of those compressed files are the datasets containing the data you want to work with, which have to be unzipped before using. This can become a very redundant and laborious task depending on how many files you have to repeat these steps for. A solution to help protect against this repetitive process is by introducing some automation into your life by writing a R script to perform these actions for you. Once the script is written, you can run it, sit back, and watch the magic happen.

Invoking this automation will then allow you to dynamically download, as well as update, your data by just a click of a button, and yes, you will feel like a wizard.

The Data

In this post, we will be exploring the financial statement datasets found on the US Securities and Exchange’s website. A detailed description of the data can be found there, but in short, it is data from filed corporate financial reports. The data’s presented in .zip formats organized by the year and quarter, starting with 2009 Q1, each containing four tables.

SECdatasets.jpg

These tables are named sub (Submissions), tag (Tags), num (Numbers), and pre (Presentation of Statements). Since we aren’t going to concern ourselves with scrubbing or exploration of the data yet, there is no need to go into detail with them just quite. (For you go-getters out there, though, here is the link to detailed information about them.)

To download the zip files, you would just click on which one you want, but since we are introducing a new method here, you don’t have to waste your time doing that. What you can do, though, is take note of the format of the links for the files.

The Code

The next steps are to finally write the code!

Setup

Now, before we get started, if you don’t have RStudio already downloaded onto your computer, you can do so by following this link. We will be using RStudio to write and run our R code, so if you want to follow along, I recommend downloading it. Once you open up the IDE, create a new R script and install the package lubridate. This package makes dealing with dates easier, and we will be using it to obtain the current year.

RStudio2-1.jpg

We will also need a parent folder you are going to save the folders to that we will be downloading. I named mine “Financial Statement Dataset,” but you can name yours whatever you would like. Do take note of the file path, though.

Writing the Code

Once you have completed the initial steps, we can start the fun part: coding! (For someone with a more computer science background like myself, this part is particularly enjoyable.) So, in a nutshell, what this code does is it will grab all of the zip files from the specified website, download them, and unzip them into their own individual folders. Cool, right?

The first piece of code we need is an import statement to import lubridate and the code to obtain the current date and the last two digits of the year:

install.packages("lubridate")
year <- lubridate::year(Sys.Date())
yearShort <- year %% 100

Note that we are explicitly forcing R to use the lubridate version of the year function by calling it as lubridate::year() in the second line. This is best practice if you have another package installed with a function of the same name.

We then need to write a nested for loop. This is needed to automate the process of obtaining these files. The reason why it has to be a nested one is because the zip files are named according to both year and quarter. We have to go year by year as well as quarter by quarter for downloading the files.

for(i in 8:(yearShort-1))
{
  i = i+1
  for(j in 0:3)
  {
    j = j+1
  }
}

i represents the last two digits of the year whereas j represents the quarter number. The way this for loop works is that i will start at 9, wait until j reaches 4, and then increase to the next number until the current year is reached and the last quarters are downloaded.

We now need to populate the inner for loop. The first thing we need to do is get the addresses of the zip files. Since each zip file has a different name, each address is different. No need to fear though, this is why we made those for loops. The block of code below is what we need to place within our inner loop:

   if (i < 10){
      url2 = paste(url1, "0", i, "q", sep="") 
      } else {
      url2 = paste(url1, i, "q", sep="") 
      }

fullUrl = paste(url2, j, ".zip", sep="")

url1 will act as our base to build the unique URLs off of. All of the file names start with 20 but vary once the next two digits come along. To reference the correct URLs, we will use url2, which will use i to count the years starting from 2009. 2009 is different from the other years, because we need to put a static 0 in front of the 9. This is because yearShort doesn’t return it as 09. Once we have that, we can then add the q for quarter, as specified in the file path format. Finally, we can use the variable fullUrl to add the quarter number, j, onto the rest of the URL, url2.

Next, we need to specify a destination for the zip files to download to so they can be unzipped. The code below goes right under the code we just wrote:

destination = "\\YOUR PATH\\Financial Statement Dataset"
destinationUrl = paste(destination, j, ".zip", sep="")
download.file(url = fullUrl, destfile = destinationUrl, method = "auto", quiet = FALSE)
 

destination is the first portion of the file path where we want to save the zip files to so we can extract the content inside of them. destinationUrl takes that first part of the path which is the same for all of the files and adds j, the numbers 1-4, and “.zip” to it. It doesn’t matter what we name these since they are only here to unzip the contents into other folders.

Lastly, we need to specify the folders we are going to unzip the file contents to. This code will go beneath the code we just wrote:

parentFolder = "\\YOUR PATH\\Financial Statement Dataset\\"

if (i < 10) 
unzipFolder = paste(parentFolder, "200", i, "q", j, sep="")
else
unzipFolder = paste(parentFolder, "20", i, "q", j, sep="")
unzipFolder = paste(parentFolder, "20", i, "q", j, sep="")
unzip(zipfile = destinationUrl, exdir = unzipFolder)


parentFolder is exactly what its name implies. It is the parent folder of the folders we will unzip the contents of the zip files to. This part of the file path for the folders stays the same, but then we need another variable, unzipFolder, to specify the different folders. unzipFolder uses parentFolder as a base, and then adds the beginning part of the year, the last digit(s) of the year, a “q” for quarter, and the quarter to it. This names the folders exactly what they are named on the website so there is no confusion of which is which. The last piece of code we need, the unzip() method, unzips the file in destinationUrl to unzipFolder, which is the variable we just specified.

finishedCode.png

This completes the code, and now we can move onto testing and running it!

Testing the Code

Technically, you should have been testing the code as you go, but I will share the way I tested my code with you now. I simply put a few print() statements in certain spots in my code I wanted to make sure were correct. The first one I put was outside of the inner for loop to make sure we are using the current year.

print(year)

Next, we will make sure our addresses to the files we want to download are correct. We can do this by putting a statement below the assigning of the variable fullUrl to output the URL to the console.

print(fullUrl)

Lastly, we will make sure the files are being unzipped to the correct folders. We can do this by including a print() statement right above the unzip() command.

print(unzipFolder)

Now when you run the script, the console will output the current year, the URL of the files we are downloading, and the folders we are unzipping the files to. Of course you can include more statements than this if you would like, but this is the bare minimum you should have.

Running the Code

Once the code is complete, meaning there are no errors and we are positive it is going to do what we want, we can finally run this bad boy and feel just like the wizard we are. Highlight all of your script, hit run, and you will see the files instantly start to download into your folder you specified. The output in the console will show you the files that are being downloaded as well as dialog boxes that will pop up informing you of the progress.

downloading.png

And there you have it! All of the data from the website is now downloaded and unzipped onto your computer.

It was easy, right?

DOWNLOAD THE R SCRIPT

What’s Next?

Since we have this data dynamically downloading and unzipping to our computer, we can move onto the next few steps of the data science pipeline. This includes scrubbing and exploring the data (the “S” and “E” of O.S.E.M.N.).