Often with education data, I need to pivot data to go from long format to wide. For instance, I often have test scores for different test subjects of a single assessment. For one student, I will have to rows, one for one subject and one for another. This also happens for strands and domains of a test.
library(tidyverse)
library(DBI)
library(odbc)
library(knitr)
con <- dbConnect(odbc::odbc(), "nicodemus")
query <- dbSendQuery(con, "select * from Demo.dbo.tbl_testScore")
testScore <- dbFetch(query)
testScore <- as_tibble(testScore)
testScore
## # A tibble: 16 x 4
## personID testName testSubject scaleScore
## <dbl> <chr> <chr> <dbl>
## 1 1000 MCA-III Math 344
## 2 1000 MCA-III Reading 350
## 3 1001 MCA-III Math 323
## 4 1001 MCA-III Reading 312
## 5 1002 MCA-III Math 375
## 6 1002 MCA-III Reading 383
## 7 1003 MCA-III Math 311
## 8 1003 MCA-III Reading 302
## 9 1004 MCA-III Math 355
## 10 1004 MCA-III Reading 352
## 11 1005 MCA-III Math 323
## 12 1005 MCA-III Reading 320
## 13 1006 MCA-III Math 312
## 14 1006 MCA-III Reading 309
## 15 1007 MCA-III Math 356
## 16 1007 MCA-III Reading 365
Above is a very simple table of test scores where personID
is the unique identifier, and records duplicate by testSubject
for each student. How would I break out testSubject
as columns and put the scaleScore
in the values of those columns? Like say then to plot on a scatterplot?
Pivoting in SQL
In SQL, the syntax is the following. It’s important to note (as opposed to the dplyr
option later) that this is an aggregate function. So the pivot
function requires some aggregate like MAX()
, MIN()
, SUM()
, or AVG()
. You choose the column to pivot out and then the values you want to place in those new columns given some aggregation. Since I know I only have one record for each student + test subject combination, I chose MAX()
to just give me that one score.
select *
from Demo.dbo.tbl_testScore
pivot(
MAX(scaleScore) for testSubject in ([Math],[Reading]) --aggregate function of one column for another column, listing the unique possibilities in brackets.
) piv
## # A tibble: 8 x 4
## personID testName Math Reading
## <dbl> <chr> <dbl> <dbl>
## 1 1000 MCA-III 344 350
## 2 1001 MCA-III 323 312
## 3 1002 MCA-III 375 383
## 4 1003 MCA-III 311 302
## 5 1004 MCA-III 355 352
## 6 1005 MCA-III 323 320
## 7 1006 MCA-III 312 309
## 8 1007 MCA-III 356 365
There you have it! Pretty straight-forward. One major downside to pivoting in SQL is that you need to know exactly the unique values of the pivoting column. In our example, I had to hardcode Math
and Reading
in the brackets, and I knew that this test only has these two subjects. But what if the pivoting column has an unknown number of values, or the data changes?
It makes sense that a database programming language wouldn’t have the function to automatically pivot out all the values, as RDBMS’s are highly structed.
Pivoting in R with dplyr
In the dplyr
package you can use the spread()
and gather()
function to pivot and unpivot columns respectively. spread()
requires that you put the pivoted column in key
and the values you want to place in the pivoted columns in value
.
testScore %>%
spread(key = testSubject, value = scaleScore)
## # A tibble: 8 x 4
## personID testName Math Reading
## <dbl> <chr> <dbl> <dbl>
## 1 1000 MCA-III 344 350
## 2 1001 MCA-III 323 312
## 3 1002 MCA-III 375 383
## 4 1003 MCA-III 311 302
## 5 1004 MCA-III 355 352
## 6 1005 MCA-III 323 320
## 7 1006 MCA-III 312 309
## 8 1007 MCA-III 356 365
Same result as above with SQL. What I struggled with the most initially with this function (coming from the SQL world) was that there must be only one value to put in the pivoted columns; otherwise, it erors. The SQL solution, on the other hand, has you put an aggregate function, so that if it meets two values, it has an aggregation to get to one.
What if you don’t have unique values to pivot in R?
Let’s say in our example that a student had two Math and Reading scores and you needed to take the maximum for each subject. See the table below where the last student has two more records.
testScoreMore <- testScore %>%
add_row(personID = 1007, testName = "MCA-III", testSubject = "Math", scaleScore = 363) %>%
add_row(personID = 1007, testName = "MCA-III", testSubject = "Reading", scaleScore = 395)
## # A tibble: 18 x 4
## personID testName testSubject scaleScore
## <dbl> <chr> <chr> <dbl>
## 1 1000 MCA-III Math 344
## 2 1000 MCA-III Reading 350
## 3 1001 MCA-III Math 323
## 4 1001 MCA-III Reading 312
## 5 1002 MCA-III Math 375
## 6 1002 MCA-III Reading 383
## 7 1003 MCA-III Math 311
## 8 1003 MCA-III Reading 302
## 9 1004 MCA-III Math 355
## 10 1004 MCA-III Reading 352
## 11 1005 MCA-III Math 323
## 12 1005 MCA-III Reading 320
## 13 1006 MCA-III Math 312
## 14 1006 MCA-III Reading 309
## 15 1007 MCA-III Math 356
## 16 1007 MCA-III Reading 365
## 17 1007 MCA-III Math 363
## 18 1007 MCA-III Reading 395
If I try to run the same code as above it gives an error:
testScoreMore %>%
spread(key = testSubject, value = scaleScore)
## Error: Duplicate identifiers for rows (15, 17), (16, 18)
I think the easiest solution is to just do your aggregation with group_by()
and summarize()
earlier in the pipeline, and then do your spread()
after.
testScoreMore %>%
group_by(personID, testName, testSubject) %>% #get to unique identifiers with these two lines.
summarize(scaleScore = max(scaleScore)) %>% #using max here.
spread(key = testSubject, value = scaleScore)
## # A tibble: 8 x 4
## # Groups: personID, testName [8]
## personID testName Math Reading
## <dbl> <chr> <dbl> <dbl>
## 1 1000 MCA-III 344 350
## 2 1001 MCA-III 323 312
## 3 1002 MCA-III 375 383
## 4 1003 MCA-III 311 302
## 5 1004 MCA-III 355 352
## 6 1005 MCA-III 323 320
## 7 1006 MCA-III 312 309
## 8 1007 MCA-III 363 395
Cool! Hey, now that we have columns for each subject, is there a relationship between them?
testScoreMore %>%
group_by(personID, testName, testSubject) %>%
summarize(scaleScore = max(scaleScore)) %>%
spread(key = testSubject, value = scaleScore) %>%
ggplot(aes(x = Math, y = Reading)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)