Hypothesis is that the further into a term, the lower grades become. Could provide evidence and estimate into how score trends might change based on term length.
The dataset includes all of our warehoused daily grades.
SELECT DISTINCT
gh.date_stored
, gh.state_id
, gh.school_number
, gh.course
, gh."section"
, gh.grade
, gh.time_stored
, gh.term
, t.end_year
, t.calendar_name
, t.name AS term
, t.start_date
, t.end_date
, gh.date_stored - t.start_date AS daysIn
FROM gradesync.grade_history gh
INNER JOIN gradesync.term t
ON gh.school_number = t.school_number
AND gh.term = t.name
AND gh.date_stored BETWEEN t.start_date AND t.end_date
WHERE gh.school_number = 1202056
ORDER BY time_stored DESC;
Load the data
gradeHistory <- read.csv("~/Downloads/gradeHistory.csv")
Summarize the data. Check to make sure we don’t need to clean anything up.
summary(gradeHistory)
## date_stored state_id school_number course
## Length:1281910 Min. :10010165 Min. :1202056 Length:1281910
## Class :character 1st Qu.:30718317 1st Qu.:1202056 Class :character
## Mode :character Median :53867416 Median :1202056 Mode :character
## Mean :54716676 Mean :1202056
## 3rd Qu.:78084032 3rd Qu.:1202056
## Max. :99726475 Max. :1202056
## section grade time_stored term
## Min. : 1.000 Min. : 0.00 Length:1281910 Length:1281910
## 1st Qu.: 1.000 1st Qu.: 79.73 Class :character Class :character
## Median : 2.000 Median : 90.85 Mode :character Mode :character
## Mean : 7.014 Mean : 85.41
## 3rd Qu.: 4.000 3rd Qu.: 96.96
## Max. :500.000 Max. :460.00
## end_year calendar_name term.1 start_date
## Min. :2020 Length:1281910 Length:1281910 Length:1281910
## 1st Qu.:2020 Class :character Class :character Class :character
## Median :2020 Mode :character Mode :character Mode :character
## Mean :2020
## 3rd Qu.:2021
## Max. :2021
## end_date daysin
## Length:1281910 Min. : 4.00
## Class :character 1st Qu.: 43.00
## Mode :character Median : 77.00
## Mean : 77.11
## 3rd Qu.:109.00
## Max. :158.00
Clean up records that have irregular scores (scores > 100). We will set those to be 100 to be uniform. We don’t want those bad grades to become outliers, but we can assumed that the grade is a 100%.
gradeHistory <- within(gradeHistory, grade[grade > 100] <- 100)
summary(gradeHistory$grade)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 79.73 90.85 85.36 96.96 100.00
Remove records that have more than the typical 120 daysIn.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
gradeHistory <- filter(gradeHistory, daysin <= 120)
summary(gradeHistory$daysin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 38.00 67.00 66.48 95.00 120.00
Aggregate our data by grouping by the days into a term.
library(dplyr)
mean_data <- group_by(gradeHistory, daysin)
Calculate the mean and standard deviation of grade by the days into a term.
library(ggplot2)
summarizeddf <- summarize(mean_data
, mean = mean(grade,
na.rm=TRUE),
sd = sd(grade,
na.rm=TRUE)
, n=120)
print(summarize(mean_data, mean = mean(grade, na.rm=TRUE), sd = sd(grade, na.rm=TRUE)), n=120)
## # A tibble: 117 x 3
## daysin mean sd
## * <int> <dbl> <dbl>
## 1 4 92.7 20.3
## 2 5 92.7 20.3
## 3 6 92.3 20.9
## 4 7 89.6 24.2
## 5 8 89.4 23.3
## 6 9 88.7 23.2
## 7 10 89.3 21.7
## 8 11 90.7 18.2
## 9 12 90.6 18.2
## 10 13 90.7 18.1
## 11 14 90.2 18.6
## 12 15 87.1 22.7
## 13 16 87.8 20.7
## 14 17 88.2 20.0
## 15 18 87.7 20.0
## 16 19 87.7 19.9
## 17 20 87.8 19.7
## 18 21 87.0 20.7
## 19 22 87.0 21.3
## 20 23 87.0 20.6
## 21 24 87.3 20.2
## 22 25 87.2 20.0
## 23 26 86.2 21.8
## 24 27 87.2 20.0
## 25 28 87.1 19.9
## 26 29 87.1 18.7
## 27 30 87.2 18.4
## 28 31 86.6 19.3
## 29 32 86.6 18.7
## 30 33 86.6 18.6
## 31 34 86.6 18.7
## 32 35 86.5 18.6
## 33 36 86.1 18.6
## 34 37 86.1 18.6
## 35 38 85.8 18.5
## 36 39 86.1 18.2
## 37 40 86.1 18.2
## 38 41 86.1 18.2
## 39 42 85.7 18.5
## 40 43 85.6 18.4
## 41 44 85.5 18.2
## 42 45 85.6 18.0
## 43 46 85.5 18.0
## 44 47 85.4 17.9
## 45 48 85.4 17.9
## 46 49 85.5 17.8
## 47 50 85.2 17.5
## 48 51 85.3 17.5
## 49 52 85.4 17.3
## 50 53 85.4 17.2
## 51 54 85.3 17.2
## 52 55 85.3 17.2
## 53 56 85.2 17.2
## 54 57 85.2 17.2
## 55 58 84.9 17.3
## 56 59 84.8 17.2
## 57 60 84.8 17.1
## 58 61 84.8 17.1
## 59 62 84.8 17.1
## 60 63 84.3 17.3
## 61 64 86.2 14.4
## 62 65 84.3 17.1
## 63 66 84.4 17.0
## 64 67 84.5 16.8
## 65 68 84.5 16.8
## 66 69 84.5 16.9
## 67 70 84.2 17.1
## 68 71 84.1 17.0
## 69 72 84.0 17.0
## 70 73 84.0 17.0
## 71 74 84.0 17.0
## 72 75 84.0 16.9
## 73 76 84.0 16.9
## 74 77 84.2 17.0
## 75 78 84.2 16.9
## 76 79 84.0 17.0
## 77 80 84.1 16.9
## 78 81 84.0 16.9
## 79 82 84.1 16.8
## 80 83 84.1 16.9
## 81 84 84.2 16.8
## 82 85 85.4 14.3
## 83 86 84.3 16.8
## 84 87 84.3 16.7
## 85 88 84.4 16.5
## 86 89 84.4 16.6
## 87 90 84.4 16.5
## 88 91 84.2 16.6
## 89 92 84.2 16.5
## 90 93 84.2 16.5
## 91 94 84.2 16.5
## 92 95 84.2 16.5
## 93 96 84.2 16.5
## 94 97 84.1 16.6
## 95 98 84.0 16.7
## 96 99 84.0 16.6
## 97 100 84.1 16.6
## 98 101 84.0 16.6
## 99 102 84.1 16.6
## 100 103 84.1 16.6
## 101 104 84.1 16.6
## 102 105 84.1 16.6
## 103 106 84.1 16.6
## 104 107 84.1 16.5
## 105 108 84.0 16.6
## 106 109 84.0 16.5
## 107 110 84.0 16.6
## 108 111 83.9 16.6
## 109 112 83.8 16.6
## 110 113 83.0 17.9
## 111 114 83.9 16.5
## 112 115 83.0 17.8
## 113 116 83.0 17.8
## 114 117 83.8 16.5
## 115 118 83.8 16.5
## 116 119 83.7 16.5
## 117 120 82.7 17.5
Plot the data.
plot(summarizeddf$daysin
, summarizeddf$mean
, type="o"
, xlab="Days from Term Start"
, ylab="Average Course Grade")
Summary |
At this point, there would be enough evidence to move forward with a prediction model based on possible term lengths being proposed. We could make the assumption that the shorter the term, the better grades will be at the end of the term.