I’ve copied the following data dictionary from the TiduTuesday GitHub to reference more easily.
Data Dictionary
egg-production.csv
variable
class
description
observed_month
double
Month in which report observations are collected,Dates are recorded in ISO 8601 format YYYY-MM-DD
prod_type
character
type of egg product: hatching, table eggs
prod_process
character
type of production process and housing: cage-free (organic), cage-free (non-organic), all. The value ‘all’ includes cage-free and conventional housing.
n_hens
double
number of eggs produced by hens for a given month-type-process combo
n_eggs
double
number of hens producing eggs for a given month-type-process combo
source
character
Original USDA report from which data are sourced. Values correspond to titles of PDF reports. Date of report is included in title.
cage-free-percentages.csv
variable
class
description
observed_month
double
Month in which report observations are collected,Dates are recorded in ISO 8601 format YYYY-MM-DD
percent_hens
double
observed or computed percentage of cage-free hens relative to all table-egg-laying hens
percent_eggs
double
computed percentage of cage-free eggs relative to all table eggs,This variable is not available for data sourced from the Egg Markets Overview report
source
character
Original USDA report from which data are sourced. Values correspond to titles of PDF reports. Date of report is included in title.
# A tibble: 2 × 4
observed_month percent_hens percent_eggs source
<date> <dbl> <dbl> <chr>
1 2007-12-31 3.2 NA Egg-Markets-Overview-2019-10-19.pdf
2 2021-02-28 29.2 NA Egg-Markets-Overview-2021-03-05.pdf
The cage-free percentages data goes back to 2007, while the egg production data only goes back to 2016. The most recent observations in both datasets is Feb. 28, 2021.
My data with the hypothesis that the number of eggs produced per hen differs based on the type of egg product (hatchling or table eggs).
H0: There is no difference between the number of eggs produced per hen. HA: There is a difference between the number of eggs produced per hen, and table eggs are produced at higher rates compared to hatchling eggs.
The outcome interest is the rate of egg production, which I will name rate_prod for rate of production.
Let’s prepare the datasets for a merge, which will hopefully this will make things easy to work with. I’ll start by making a variable that pulls just the month and year from ‘observed_month’ to make merge the datasets more seamless.
## setting the seed set.seed(123)## Put 3/4 of the data into the training set data_split <-initial_split(eggs_df, prop =3/4)# Create data frames for the two sets:train_data <-training(data_split)test_data <-testing(data_split)egg_metrics <-metric_set(accuracy, roc_auc, mn_log_loss)#Cross validationset.seed(123)five_fold <-vfold_cv(train_data, v =5, strata = prod_type)
All the models appears to perform similarly… the logistic regression model is the might be the simplest model, so let’s use that to to test model performance against the test_data
# Make predictions for the test seteggs_results <- test_data %>%select(prod_type) %>%bind_cols(logreg_fit %>%predict(new_data = test_data)) %>%bind_cols(logreg_fit %>%predict(new_data = test_data, type ="prob"))# Print predictionseggs_results %>%slice_head(n =54)
##Is 100% accurate## trying to pull ROC-AUC to see if performance of predictive model is goodroc_auc(eggs_results,truth = prod_type,`.pred_hatching eggs`)
Warning: There were 2 warnings in `dplyr::summarise()`.
The first warning was:
ℹ In argument: `.estimate = metric_fn(...)`.
ℹ In group 1: `prod_type = table eggs`.
Caused by warning:
! No control observations were detected in `truth` with control level 'hatching eggs'.
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
# A tibble: 2 × 4
prod_type .metric .estimator .estimate
<fct> <chr> <chr> <dbl>
1 table eggs roc_auc binary NA
2 hatching eggs roc_auc binary NA
##estimate is NA?
As far as I can tell, it seemed like the logistic regression model was able to predict prod_type perfectly. Intuitively, it seems like these models should not performed perfectly, but in looking at the data it is very clear that hens producing table eggs lay significantly more eggs than those producing hatching eggs. Unfortunately, my knowledge of tidymodels has limited my ability to troubleshoot potential issues in these classification models…