Connecting to MAMP install of MySQL in R

Wednesday, March 11, 2020 · 6 minute read

In the course of doing a web dev tutorial (namely Tania Rascia’s great post on building a simple CRUD app) I had to install MAMP. This gives you a local server environment consisting of Apache, MySQL, and PHP.

Since I work with databases so often, I was curious about the simplest way to connect to my newly installed instance of MySQL in R.

Which R package?

This tutorial assumes you have already installed MAMP and have it currently running on your local machine in a macOS environment.

There are lots of different ways to connect to a db in R. RStudio has a great website that walks through their enterprise packages for working with databases, but I’ve often found this site confusing. Setting up drivers, ODBC connections, and DSN files is difficult enough in Windows, let alone on a Mac.

So instead of attempting to set up an ODBC connection to MySQL, I am simply going to use the RMariaDB package coupled with DBI (I’m skipping the odbc package altogether). I tend to always trust the folks at RStudio for great package development, and this is a part of the DBI interface that in my experience gives the best performance for database work in R.

So if you haven’t done so already, install the following packages (along with nycflights13 which we’ll be using for a demo):

install.packages("tidyverse")
install.packages("DBI")
install.packages("RMariaDB")
install.packages("nycflights13")

Creating a connection to the database

Okay, let’s first see if we can connect to the database. Since I don’t want to embed my username and password, I am using environmental variables. You can learn more on how to do that on RStudio’s Databases using R page.

library(tidyverse)
library(odbc)
library(DBI)
library(RMariaDB)
library(nycflights13)
con <- DBI::dbConnect(RMariaDB::MariaDB(), user = Sys.getenv("mysql_user"), password = Sys.getenv("mysql_pass"), host = "localhost")

Error: Failed to connect: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)

Great! Immediately we have an error. According to MySQL’s documentation, Unix systems use this socket file for connecting to MySQL databases. Mine isn’t there because I used MAMP to install MySQL; MAMP places the socket file here: /Applications/MAMP/tmp/mysql/mysql.sock.

So what we can do is write a bash command that creates a hard link from the MAMP location to where the RMariaDB package expects it:

#either open terminal and run what's within quotes, or run it write from r with the system() function.
system("ln -s /Applications/MAMP/tmp/mysql/mysql.sock /tmp/mysql.sock")

Okay, let’s try to connect again.

con <- DBI::dbConnect(RMariaDB::MariaDB(), user = Sys.getenv("mysql_user"), password = Sys.getenv("mysql_pass"), host = "localhost")

dbGetQuery(conn = con, statement = "show databases;")
##             Database
## 1 information_schema
## 2               crud
## 3              mysql
## 4        nyc_flights
## 5 performance_schema
## 6                sys

Awesome! We are connected to our MySQL instance that we installed with MAMP. Note: If you restart your machine, you will have to re-run the bash command above…Not quite sure why that is, but you can probably figure out a more permanent solution.

Reading and writing from the database

The dbGetQuery function is the easiest way to read/write to the database. The first argument is just the name of your connection (mine is con), and the second is your sql statement.

Let’s create a database of the nycflights13 data, and query it!

dbGetQuery(con, "CREATE DATABASE nyc_flights;") #create the database

Great, okay my plan is to insert each of the five dataframes of nycflights13 into the db as tables. So below, I first need to modify my connection to connect to the database we just created. I’m then going to create a simple function that takes each dataframe in the package and copies it to a table in the database.

#new connection
con_flights <- DBI::dbConnect(RMariaDB::MariaDB(), user = Sys.getenv("mysql_user"), password = Sys.getenv("mysql_pass"), db = "nyc_flights", host = "localhost") #added db argument.

#function
insert_db_table <- function(x) {    #creating function insert_db_table
  copy_to(con_flights, x,           #x is the dataframe
          deparse(substitute(x)),   #here I'm using the substitute() and deparse() function just to get the dataframe name as a string
          temporary = FALSE,        #we'll make them permanent tables
          overwrite = TRUE          #and overwrite them if they already exist
      )
}

#run the function for each dataframe in nycflights13
insert_db_table(airlines)
insert_db_table(airports)
insert_db_table(flights)
insert_db_table(planes)
insert_db_table(weather)

Let’s see if the data’s there.

dbGetQuery(con_flights, statement = "select * from flights") %>% 
  as_tibble()
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Awesome!

Querying with dbplyr

So far we have been using dbGetQuery, which queries the data and brings it back in memory in R. But since the data is in a SQL database, we can actually write queries remotely and not pull the data into R until we need it; dbplyr executes dplyr code on the MySQL server and lazily evaluates it’s results. You can then use collect() to actually capture the total results in memory for plotting.

tbl_flights <- tbl(con_flights, "flights")   #dbplyr's tbl function
tbl_airports <- tbl(con_flights, "airports")

tbl_flights %>% 
  inner_join(tbl_airports, by = c("dest" = "faa")) %>% 
  count(name)
## # Source:   lazy query [?? x 2]
## # Database: mysql [root@localhost:NA/nyc_flights]
##    name                              n      
##    <chr>                             <int64>
##  1 Akron Canton Regional Airport      864   
##  2 Albany Intl                        439   
##  3 Albuquerque International Sunport  254   
##  4 Asheville Regional Airport         275   
##  5 Austin Bergstrom Intl             2439   
##  6 Baltimore Washington Intl         1781   
##  7 Bangor Intl                        375   
##  8 Birmingham Intl                    297   
##  9 Blue Grass                           1   
## 10 Bob Hope                           371   
## # … with more rows

Although in my early experience of using dbplyr commands I have found it much slower than just bringing all the data into R, I think it’s still a useful method for working with databases. You can read more about it on RStudio’s website. Cheers!