Hypothesis is that the further into a term, the lower grades become. Could provide evidence and estimate into how score trends might change based on term length.

The dataset includes all of our warehoused daily grades.

SELECT DISTINCT 
  gh.date_stored 
, gh.state_id 
, gh.school_number 
, gh.course 
, gh."section" 
, gh.grade 
, gh.time_stored 
, gh.term 
, t.end_year
, t.calendar_name
, t.name AS term
, t.start_date
, t.end_date
, gh.date_stored - t.start_date AS daysIn
FROM gradesync.grade_history gh
INNER JOIN gradesync.term  t
ON gh.school_number = t.school_number
AND gh.term = t.name
AND gh.date_stored BETWEEN t.start_date AND t.end_date
WHERE gh.school_number = 1202056
ORDER BY time_stored DESC;

Load the data

gradeHistory <- read.csv("~/Downloads/gradeHistory.csv")

Summarize the data. Check to make sure we don’t need to clean anything up.

summary(gradeHistory)
##  date_stored           state_id        school_number        course         
##  Length:1281910     Min.   :10010165   Min.   :1202056   Length:1281910    
##  Class :character   1st Qu.:30718317   1st Qu.:1202056   Class :character  
##  Mode  :character   Median :53867416   Median :1202056   Mode  :character  
##                     Mean   :54716676   Mean   :1202056                     
##                     3rd Qu.:78084032   3rd Qu.:1202056                     
##                     Max.   :99726475   Max.   :1202056                     
##     section            grade        time_stored            term          
##  Min.   :  1.000   Min.   :  0.00   Length:1281910     Length:1281910    
##  1st Qu.:  1.000   1st Qu.: 79.73   Class :character   Class :character  
##  Median :  2.000   Median : 90.85   Mode  :character   Mode  :character  
##  Mean   :  7.014   Mean   : 85.41                                        
##  3rd Qu.:  4.000   3rd Qu.: 96.96                                        
##  Max.   :500.000   Max.   :460.00                                        
##     end_year    calendar_name         term.1           start_date       
##  Min.   :2020   Length:1281910     Length:1281910     Length:1281910    
##  1st Qu.:2020   Class :character   Class :character   Class :character  
##  Median :2020   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2020                                                           
##  3rd Qu.:2021                                                           
##  Max.   :2021                                                           
##    end_date             daysin      
##  Length:1281910     Min.   :  4.00  
##  Class :character   1st Qu.: 43.00  
##  Mode  :character   Median : 77.00  
##                     Mean   : 77.11  
##                     3rd Qu.:109.00  
##                     Max.   :158.00

Clean up records that have irregular scores (scores > 100). We will set those to be 100 to be uniform. We don’t want those bad grades to become outliers, but we can assumed that the grade is a 100%.

gradeHistory <- within(gradeHistory, grade[grade > 100] <- 100)
summary(gradeHistory$grade)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   79.73   90.85   85.36   96.96  100.00

Remove records that have more than the typical 120 daysIn.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gradeHistory <- filter(gradeHistory, daysin <= 120)
summary(gradeHistory$daysin)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   38.00   67.00   66.48   95.00  120.00

Aggregate our data by grouping by the days into a term.

library(dplyr)
mean_data <- group_by(gradeHistory, daysin)

Calculate the mean and standard deviation of grade by the days into a term.

library(ggplot2)
summarizeddf <- summarize(mean_data
                          , mean = mean(grade, 
                                  na.rm=TRUE), 
                                  sd = sd(grade, 
                                  na.rm=TRUE)
                          , n=120)
print(summarize(mean_data, mean = mean(grade, na.rm=TRUE), sd = sd(grade, na.rm=TRUE)), n=120)
## # A tibble: 117 x 3
##     daysin  mean    sd
##   *  <int> <dbl> <dbl>
##   1      4  92.7  20.3
##   2      5  92.7  20.3
##   3      6  92.3  20.9
##   4      7  89.6  24.2
##   5      8  89.4  23.3
##   6      9  88.7  23.2
##   7     10  89.3  21.7
##   8     11  90.7  18.2
##   9     12  90.6  18.2
##  10     13  90.7  18.1
##  11     14  90.2  18.6
##  12     15  87.1  22.7
##  13     16  87.8  20.7
##  14     17  88.2  20.0
##  15     18  87.7  20.0
##  16     19  87.7  19.9
##  17     20  87.8  19.7
##  18     21  87.0  20.7
##  19     22  87.0  21.3
##  20     23  87.0  20.6
##  21     24  87.3  20.2
##  22     25  87.2  20.0
##  23     26  86.2  21.8
##  24     27  87.2  20.0
##  25     28  87.1  19.9
##  26     29  87.1  18.7
##  27     30  87.2  18.4
##  28     31  86.6  19.3
##  29     32  86.6  18.7
##  30     33  86.6  18.6
##  31     34  86.6  18.7
##  32     35  86.5  18.6
##  33     36  86.1  18.6
##  34     37  86.1  18.6
##  35     38  85.8  18.5
##  36     39  86.1  18.2
##  37     40  86.1  18.2
##  38     41  86.1  18.2
##  39     42  85.7  18.5
##  40     43  85.6  18.4
##  41     44  85.5  18.2
##  42     45  85.6  18.0
##  43     46  85.5  18.0
##  44     47  85.4  17.9
##  45     48  85.4  17.9
##  46     49  85.5  17.8
##  47     50  85.2  17.5
##  48     51  85.3  17.5
##  49     52  85.4  17.3
##  50     53  85.4  17.2
##  51     54  85.3  17.2
##  52     55  85.3  17.2
##  53     56  85.2  17.2
##  54     57  85.2  17.2
##  55     58  84.9  17.3
##  56     59  84.8  17.2
##  57     60  84.8  17.1
##  58     61  84.8  17.1
##  59     62  84.8  17.1
##  60     63  84.3  17.3
##  61     64  86.2  14.4
##  62     65  84.3  17.1
##  63     66  84.4  17.0
##  64     67  84.5  16.8
##  65     68  84.5  16.8
##  66     69  84.5  16.9
##  67     70  84.2  17.1
##  68     71  84.1  17.0
##  69     72  84.0  17.0
##  70     73  84.0  17.0
##  71     74  84.0  17.0
##  72     75  84.0  16.9
##  73     76  84.0  16.9
##  74     77  84.2  17.0
##  75     78  84.2  16.9
##  76     79  84.0  17.0
##  77     80  84.1  16.9
##  78     81  84.0  16.9
##  79     82  84.1  16.8
##  80     83  84.1  16.9
##  81     84  84.2  16.8
##  82     85  85.4  14.3
##  83     86  84.3  16.8
##  84     87  84.3  16.7
##  85     88  84.4  16.5
##  86     89  84.4  16.6
##  87     90  84.4  16.5
##  88     91  84.2  16.6
##  89     92  84.2  16.5
##  90     93  84.2  16.5
##  91     94  84.2  16.5
##  92     95  84.2  16.5
##  93     96  84.2  16.5
##  94     97  84.1  16.6
##  95     98  84.0  16.7
##  96     99  84.0  16.6
##  97    100  84.1  16.6
##  98    101  84.0  16.6
##  99    102  84.1  16.6
## 100    103  84.1  16.6
## 101    104  84.1  16.6
## 102    105  84.1  16.6
## 103    106  84.1  16.6
## 104    107  84.1  16.5
## 105    108  84.0  16.6
## 106    109  84.0  16.5
## 107    110  84.0  16.6
## 108    111  83.9  16.6
## 109    112  83.8  16.6
## 110    113  83.0  17.9
## 111    114  83.9  16.5
## 112    115  83.0  17.8
## 113    116  83.0  17.8
## 114    117  83.8  16.5
## 115    118  83.8  16.5
## 116    119  83.7  16.5
## 117    120  82.7  17.5

Plot the data.

plot(summarizeddf$daysin
     , summarizeddf$mean
     , type="o"
     , xlab="Days from Term Start"
     , ylab="Average Course Grade")

Summary

At this point, there would be enough evidence to move forward with a prediction model based on possible term lengths being proposed. We could make the assumption that the shorter the term, the better grades will be at the end of the term.