Athena goes R: How to run Athena queries on R
Perform machine learning or statistical analysis in R on any given data set stored in AWS S3 Object Storage through the AWS Athena SQL queries service
If you work with big data, S3 object storage, and run your projects in R, chances are you might encounter the need to run queries on your S3 repository to extract or transform data before calling it into your R environment. With gigabyte-size tables, sometimes it might be more useful to call on the data by chunks, thus running queries on it directly from the database instance. Additionally, large data sets are often converted to a parquet format, rather than a CSV format, which is handled much better through SQL queries. In Amazon Web Services (AWS), this is done by Athena.
What is Athena
Amazon Athena is a service that enables a data engineer to run queries in the AWS S3 Object Storage service. In plain English, this service lets you perform real-time SQL queries over data stored in your S3 repository. You only need the credentials for the storage, the address (including bucket name), and the schema set up with the tables pointing to the wanted data set.
After this is complete, you can send and receive query results through a Java™ Database Connectivity driver (think of it as an API). Athena also needs access to the S3 database. This is done through policies in AWS Identity and Access Management. For more information on Athena, see the documentation.
Setting up your Athena instance
Log in to your AWS account. I’ll assume that you already have an S3 instance enabled with at least one bucket created and a data set in it. (Get more information on S3.) You might see a welcome screen like the one shown in the following image. Click Get Started.
Create a schema and a table as shown in the following image.
Enter the data set bucket or folder path. You can get that URL from your S3 bucket space.
Specify the data format and add the columns names for the table and data type. This should ideally be the same as the data set in S3.
Create the table. You can now run any queries from within the Athena screen.
Now, you’ll need to create a user and add the policies required for Athena. Get more information at Attaching a Policy to an IAM Group. Be sure to include the following policies: AmazonS3FullAccess and AmazonAthenaFullAccess. Also be sure to create a user within the IAM service adding the mentioned policies. After it’s created, you can create the AWS Access Key ID under the Security Credentials tab. Be sure to note the AWS Secret Access Key upon creation.
Preparing the R environment
You’re almost ready to dive into the coding. But before you can start, you must install Java 8. You might find some issues in your Mac regarding the installation of the R package rJava. If this happens, you can follow the steps at Installing rJava for R on the Mac. Follow the instructions, ignoring the Java 9 (you already have Java 8 installed), and install rJava (you’ll be installing it directly from RStudio®).
Open the RStudio IDE and start a new R script. You start by setting up the environment variables.
Sys.setenv("ATHENA_USER" = <ATHENA USER>,"ATHENA_PASSWORD" = "ATHENA PW")
I know what you’re thinking – your credentials are all exposed. So this is an in-code easy way. But, if you’re concerned about your credentials being exposed in the script, you can directly save the environment variables in RStudio.
Enter the Linux® command-line prompt to open the vi editor.
sudo vim /home/rstudio/.Renviron
Press i to INSERT content and enter the following code.
ATHENA_USER=< AWS_ACCESS_KEY_ID > ATHENA_PASSWORD=< AWS_SECRET_ACCESS_KEY>
Press esc and type wq! to save and exit the file editor.
Your credentials are now in the RStudio environment. If you have followed this method, you should start your code by getting the environment variables.
Install the required packages.
install.packages("rJava") install.packages("RJDBC") library(rJava) library(RJDBC) library(tidyverse)
Retrieve the Athena JDBC driver to set the connection. You do this by setting the URL and downloading the driver. Be sure to use the URL provided in the following code snippet. You can get updated drivers at Using Athena with the JDBC Driver.
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/JDBC/SimbaAthenaJDBC_2.0.7/AthenaJDBC42_2.0.7.jar' fil <- basename(URL)
Download the driver and verify that you have it in your R environment.
if (!file.exists(fil)) download.file(URL, fil) fil list.files()
You’re almost there. You’ll now set up the connection. With the driver you just downloaded, you’re going to set the JDBC connection driver. This is used by the
dbConnect function to allow the JDBC connection.
drv <- JDBC(driverClass="com.simba.athena.jdbc.Driver", fil, identifier.quote="'")
Now that you have the driver set up, you’ll create the connection object. The parameters are:
- The JDBC driver (which you have)
- The JDBC URL associated with your region (check this link for further insight into this)
- S3 staging directory: The folder where the query results are going to be sent
- User: Which you have as an environment variable
- Password: Which you have as an environment variable
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/', s3_staging_dir="s3://s3bucket_sample/query_results/", user=Sys.getenv("ATHENA_USER"), password=Sys.getenv("ATHENA_PASSWORD"))
Now that you have the connection set up, you can begin sending and getting queries through Athena.
You can get a list of all of the tables that you have by running the following code.
Now try to run a SELECT query to get the first 10 rows of your .csv data set. The following code runs the query, converts the result to a tidyverse data set, and shows its results.
Dfelb <- dbGetQuery(con, "SELECT * FROM athenasampledb.sample_table limit 10") %>% as_tibble() head(dfelb,10)
The following image shows the query result, and it is also sent as a .csv file to the staging directory you set in the connection.
As you could see, you can send SQL queries for any data set accessible from Athena on S3 in R. This can provide robust properties to any machine learning or statistical forecasting model that you develop in R.