Data

Load Libraries into R

# Load all necessary packages here:
library(tidyverse)
library(mosaic)
library(moderndive)
library(palmerpenguins)

Load Data into R

penguins_data <- palmerpenguins::penguins
attach(penguins_data)

Pare down variables

penguins_data = penguins_data %>%
  select(bill_depth_mm, flipper_length_mm, species)

Data wrangling

# No filtering was required

Data Cleaning

penguins_data = penguins_data %>%
  na.omit()

Data Subsetting

# Create subsets of your data by your categories:
Adelie_penguin = filter(penguins_data, species == "Adelie")
Chinstrap_penguin = filter(penguins_data, species == "Chinstrap")
Gentoo_penguin = filter(penguins_data, species == "Gentoo")

1. Introduction

1.1 Overview of Research Question

Penguins are a well-known type of seabird that lives mostly below the equator and lack the ability of flight (Penguin | Features, Habitat, & Facts, 2023). Though penguins may be famously known for their “tuxedos,” their flippers and beaks are much more necessary for them to be able to survive. The length of penguin flippers varies by species, as does the penguin’s beak depth. Understanding how the relationship between a penguins’ flipper length and beak depth may provide insights into evolutionary differences between species of penguins.

Therefore, this study is designed to examine the relationship between flipper length and culmen depth, where the culmen depth is the upper ridge of the penguin’s bill or “beak” (Horst & Gorman, 2020). This study also examined if there exists a difference between the relationship of flipper length and culmen depth by species of penguins.

In this report, data visualizations and statistical inference techniques were utilized to determine if there is a statistically significant difference in culmen depth between species. Multiple linear regression was utilized to create a model that would assess the correlation between culmen depth and flipper length, while taking into considerate the penguin species, to estimate the culmen depth. Additionally, two 90% confidence intervals were created to determine the expected population means for flipper length and culmen depth for the species of penguins analyzed in this study. To conclude, a Two-Sample T-Test was conducted to determine if there were statistically significant differences in the means between the two species of penguins included in this study.

1.2 Methodology

To answer the research question, the data set made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER (Horst & Gorman, 2020) was used. This data was downloaded from GitHub. The original data set recorded data for three species of penguins for the following variables: species, island, bill length (in mm), bill depth (in mm), flipper length (in mm), body mass (in gram), sex, and year (Horst & Gorman, 2020).

For this research study, the numerical outcome variable (y-variable) is defined as the Bill Depth (in mm), whereas the numerical explanatory variable (x-variable) is defined as the Flipper Length (in mm). The categorical variable is species, which has three levels: Adelie, Chinstrap, and Gentoo penguins.

The original data set included a sample size of 344 penguins across three species of penguins (Adelie, Chinstrap, and Gentoo). However, since two penguins had missing bill lengths, we dropped them from consideration. No information was provided as to why some penguins had missing values and most did not, so we cannot comment on the impact dropping these missing cases has on our results. Though, the impact of dropping only two penguins is likely minimal.

Each case in the data set is one of 342 penguins, and includes penguins of three different species: Adelie, Chinstrap, and Gentoo. The main outcome variable of interest in this study is the penguins’ culmen (bill) depth, with two explanatory variables: flipper length and species. Each of the two numerical variables were measured in millimeters (mm).

1.3 Snapshot of Data

In this section you should provide a table of a sample of 5 data values:

set.seed(118)
sample_n(penguins_data, 5)
bill_depth_mm flipper_length_mm species
17.8 195 Adelie
16.0 230 Gentoo
17.1 190 Chinstrap
18.8 190 Adelie
15.0 228 Gentoo

2. Exploratory data analysis

2.1 Summary statistics

Table 1: Summary Statistics for Penguin Bill Depth (in mm

favstats( ~ bill_depth_mm, data = penguins_data)
min Q1 median Q3 max mean sd n missing
13.1 15.6 17.3 18.7 21.5 17.15117 1.974793 342 0

As we can see in Table 1, generally, our penguins have an average bill depth of 17.2 mm and a median of 17.3 mm. Since these two values are so close we expect to see a somewhat symmetric distribution with a slight left-skew since the mean is lower than median. Additionally, we can see from our five-number summary that 25% of penguin bill depth is less than 15.6 mm, 50% is less than 17.3 mm, and 75% is less than 18.7 mm. Additionally, we have a standard deviation of about 2 mm from the mean. Using the empirical rule, we expect that 68% of our data is between 15.2 mm and 19.2, 95% is between 13.2 mm and 21.2 mm and 99.7% is between 11.2 mm and 23.2 mm. Since our maximum value, 21.5 mm, and minimum value, 13.1 mm, are above the bounds for where we expect 99.7% of our data to be, we can see that our data generally does not have a larger than normal spread.

Table 2: Summary Statistics for Penguin Bill Depth (in mm) by Penguin Species

favstats(bill_depth_mm ~ species, data = penguins_data)
species min Q1 median Q3 max mean sd n missing
Adelie 15.5 17.5 18.40 19.0 21.5 18.34636 1.2166498 151 0
Chinstrap 16.4 17.5 18.45 19.4 20.8 18.42059 1.1353951 68 0
Gentoo 13.1 14.2 15.00 15.7 17.3 14.98211 0.9812198 123 0

In Table 2, when we analyze our data by species, we can see that both Adelie and Chinstrap penguines have higher means and medians than the average overall. However, the Gentoo penguins mean and median are much lower than the mean and median of our overal data. This suggests that there might be a statistically significant difference when comparing Adelie and Chinstrap penguins to Gentoo penguins. Additionally, the standard deviations and ranges for Adelie and Chinstrap penguins are larger than the standard deviation and range of the Gentoo penguins, which suggests that there might be more variability in the Adelie and Chinstrap penguins when compared to the Gentoo penguins.

2.2 Barplot & Pie Chart

Visualize the sample sizes of the categorical explanatory variable using a bar plot and a pie chart and comment.

Figure 1. Sample Size Visualized by Total Count of Penguin Species

gf_bar( ~ fct_infreq(species), col = c("darkred","forestgreen", "darkblue"), fill = c("pink", "lightgreen", "skyblue"), data = penguins_data) + 
  labs(x = "Species", y = "Total Penguins") +
  theme_classic()

Looking at Figure 1, we can clearly see that majority of the penguins are Adelie penguins with n = 151 penguins, followed by Gentoo penguins with n = 123 penguins, and then Chinstrap penguins with n = 68 is the smallest number of penguins in this group. Furthermore, we can see that the number of Adelie penguins in this dataset is more than double the number of Chinstrap penguins.

Figure 2. Sample Size Visualized by Percentage of Penguin Species

par(mar=c(2,2,2,2))
species_tally <- tally(~ species, data = penguins_data)
my_lbls <- paste(names(species_tally), ", (N = ", species_tally,", ", round(species_tally/nrow(penguins_data)*100,0), "%)", sep="")
pie(species_tally, labels = my_lbls, col = c("pink", "skyblue", "lightgreen"), main="Penguin Species by Proportion")

Looking at Figure 2, we can clearly see that proportion of the penguin species. Once again, we can see the majority of the penguins are Adelie penguins with n = 151 penguins, which makes up 44% of the total number of penguins included in the analysis. Next, we can see that Gentoo penguins with n = 123 penguins make up the next largest proportion with 36% of the total number of penguins. Finally, we have Chinstrap penguins with n = 68 representing the smallest proportion of penguins, making up on about 20% of all the penguins included in this analysis. From this graph, we can see more clearly, that in terms of proportion, we have nearly equal components of Adelie and Gentoo penguins represented, both of which are nearly double the total proportion of Chinstrap penguins.

2.3 Boxplot

Visualize the relationship of the outcome variable and the categorical explanatory variable using side-by-side box plots and comment.

Figure 3. Boxplot for Penguin Bill Depth (in mm)

gf_boxplot(~ bill_depth_mm, color = "black", fill = "gold", data = penguins_data) + 
    labs(x = "Bill Depth (in mm)", 
       title = "Bill Depth Distribution For Penguins") +
  theme_minimal() +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

Using Figure 3, we can see the overall distribution for all the penguins included in this analysis is roughly symmetric, with no clear outliers present.

Using the Five-Number Summary for the variable of bill depth, we can see the values for the Minimum bill depth to be 13.10 mm and the Maximum bill depth to be 21.50 mm. The 25th quartile is 15.60 mm, the 75th quartile is 18.70 mm, and the median is 17.30 mm. This means that 25% of penguins have a bill depth less than 15.60 mm, that 50% of penguins have a bill depth less than 17.30 mm and 75% of Adelie and Gentoo penguins have a bill depth less than 18.50 mm.

Furthermore, the interquartile range (IQR) is Q3 - Q1 = 18.70 - 15.6 = 3.10. This means that 50% of all penguins are within a 3.10 mm range from the 25th percentile ot the 75th percentile. With such a small IQR, the bill depth for all penguins does not appear to be very spread out.

Since there are no points beyond the fences of our boxplots, we do not appear to have any outlier points in our overall dataset analyzing bill depth for all penguins.

Additionally, since Figure 3 does not compare groups, we cannot see if there are any statistically significant differences between our penguin species. To assess this, we will use Figure 4 to gain a more detailed look in to comparing the species of penguins.

Figure 4. Boxplot for Penguin Bill Depth (in mm) by Penguin Species

gf_boxplot(bill_depth_mm ~ species, color = ~ species, fill = ~ species, data = penguins_data) + 
  scale_color_manual(values =  c("darkred", "darkblue","forestgreen")) +
  scale_fill_manual(values=c("pink", "skyblue", "lightgreen")) +
  guides(fill=guide_legend(title="Species"), color=guide_legend(title="Species")) +
  labs(x = "Penguin Species", y = "Bill Depth (in mm)", title = "Bill Depth Distribution For Penguins", subtitle = "Comparison by Penguin Species") +
  theme_minimal()

Summarize the Five-Number summary that you see in your boxplot here: which category is the highest or lowest? What about the interquartile ranges by categories? Are there any statistially significant differences by categories?

Looking at Figure 4,and coupled with the Five-Number summary we can see that for all three boxplots are roughly symmetric.

Using the Five-Number Summary for each penguin specieis we can see that for the variable of bill depth for Adelie penguins, we have the Minimum bill depth to be 15.50 mm and the Maximum bill depth to be 21.50 mm. The 25th quartile is 17.50 mm, the 75th quartile is 19.00 mm, and the median is 18.40 mm. This means that 25% of Adelie penguins have a bill depth less than 15.60 mm, that 50% of Adelie penguins have a bill depth less than 17.30 mm and 75% of Adelie penguins have a bill depth less than 18.50 mm.

for the variable of bill depth for Chinstrap penguins, we have the Minimum bill depth to be 16.40 mm and the Maximum bill depth to be 20.80 mm. The 25th quartile is 17.50 mm, the 75th quartile is 19.40 mm, and the median is 18.45 mm. This means that 25% of Adelie penguins have a bill depth less than 15.60 mm, that 50% of Adelie penguins have a bill depth less than 18.45 mm and 75% of Adelie penguins have a bill depth less than 19.40 mm.

For the variable of bill depth for Gentoo penguins, the Five-Number Summary is much lower for all values. For Gentoo penguins, we have the Minimum bill depth to be 13.10 mm and the Maximum bill depth to be 17.30 mm. With this, we can see the maximum value does not touch the bottom of the boxes for either of the two other Specieis. The 25th quartile is 14.20 mm, the 75th quartile is 15.70 mm, and the median is 14.20 mm. This means that 25% of Adelie penguins have a bill depth less than 15.60 mm, that 50% of Adelie penguins have a bill depth less than 15.00 mm and 75% of Adelie penguins have a bill depth less than 15.70 mm.

Furthermore, the interquartile range (IQR) are 1.5 mm, 1.9 mm, and 1.5 mm for Adelie, Chinstrap, and Gentoo penguins, respectively. Comparing the three species, we can see that Chinstrap penguins have the widest spread since they have the largest IQR. However, all specieis have such a small IQR that the bill depth does not appear to be very spread out for any of the three species.

Since there are no points beyond the fences of our boxplots for Gentoo and Chinstrap penguins, we do not appear to have any outlier points for the bill depth for these two species of penguins. However, we can see one distinct point above the upper fence for Adelie points, which indicates this species does have an outlier. The outlier for Adelie penguins, appears to be the maximum point for this specieis which is a penguin who had a bill depth of 21.5 mm.

With no overlap between the box for Gentoo penguins and the boxes for Chinstrap or Adelie penguins, coupled with the median value for Gentoo penguins being much lower at 15.00 mm than either the Chinstrap (at 18.45 mm) or Adelie (at 18.40 mm), there does appear to be a statistically significant difference in bill depth for Gentoo penguins when compared to the Adelie and Chinstrap penguins.

However, when only comparing Adelie penguins with Chinstrap penguins, there does not appear to any statistically significant differences in bill depth. This is because the median values for Adelie and Chinstrap penguins are nearly identical and their boxes nearly completely overlap with each other.

2.4 Histogram

Visualize the distribution of the outcome variable using a histogram and again for your numerical outcome variable by category and comment.

Figure 5. Histogram for Penguin Bill Depth (in mm)

# Bin Width Calculation
data_bindwidth = (max(~ bill_depth_mm, data = penguins_data) - min(~ bill_depth_mm, data = penguins_data)) / sqrt(nrow(penguins_data))

#Overall Histogram
gf_histogram(~ bill_depth_mm, color = "black", fill = "gold", binwidth = data_bindwidth, data = penguins_data) +
  geom_density(aes(y = after_stat(density) * (nrow(penguins_data) * data_bindwidth)), col = "black", linewidth = 1) +
  labs(x = "Bill Depth (in mm)", y = "Count of Penguins", title = "Bill Depth Distribution For Penguins") +
  theme_minimal()

Looking at Figure 5, the mean of the sample is 17.15117 mm and the median of the sample is 17.3 mm. The center of the histogram is skewed more to the left of the graph because the mean is lower than the median.

There appears to be two peaks meaning the shape of the histogram could be bimodal: one at approximately 14 mm, and one at 18.5 mm. However, since the peak around 14 mm does not extend higher than the most prominent peak at 18.5 mm, more testing will need to be conducted in order to determine if we truly have a bimodal distribution. It will be interesting to compare the penguins that fall into either of these portions of the distribution of bill depth.

The spread of the histogram appears to be slightly skewed to the left since the mean is lower than the median. The standard deviation for the sample is 1.974793 mm. The range for the bill depth of the penguins is between 13.1 mm to 21.5 mm.

Figure 6. Histogram for Penguin Bill Depth (in mm) by Penguin Species

gf_histogram(~ bill_depth_mm, color = ~ species, fill = ~ species, binwidth = data_bindwidth, data = penguins_data) +
  geom_density(aes(y = 0.33*after_stat(density) * (nrow(penguins_data) * data_bindwidth)), col = "black", linewidth = 0.5, alpha = 0) +
  facet_grid(species ~ .) +
  scale_color_manual(values =  c("darkred", "darkblue","forestgreen")) +
  scale_fill_manual(values = c("pink", "skyblue", "lightgreen"))+
  labs(x = "Bill Depth (in mm)", y = "Count of Penguins", title = "Bill Depth Distribution For Penguins", subtitle = "Comparison by Penguin Species") +
  theme_minimal()

Looking at Figure 6, The center of the Adelie and Chinstrap are both higher than the center of the overall penguins data at 18.40 mm and 18.45 mm. Gentoo penguins is lower at approximately 15.00 mm, which is what we saw in our bimodal distribution overall. Additionally, the shape of the Adelie and Gentoo peguns seem fairly symmetric and unimodal, whereas Chinstrap penguins are presenting a more flat distribution, suggesting they might be uniformly distributed. Finally, the spread for each of the penguins species is not abnormally large given their standard deviations and means.

We can compare the categories Coefficient of Variations (CV) values to determine which of the species have the highest level of variation in comaprison to the other species:

#Coefficient of Variation: Standard Deviation / Mean by Category
Adelie_CV = sd(~bill_depth_mm, data = Adelie_penguin)/mean(~bill_depth_mm, data = Adelie_penguin)
print(paste('Adelie Penguin CV: ', round(Adelie_CV,3)*100,'%'))
## [1] "Adelie Penguin CV:  6.6 %"
Chinstrap_CV = sd(~bill_depth_mm, data = Chinstrap_penguin)/mean(~bill_depth_mm, data = Chinstrap_penguin)
print(paste('Chinstrap Penguin CV: ', round(Chinstrap_CV,3)*100,'%'))
## [1] "Chinstrap Penguin CV:  6.2 %"
Gentoo_CV = sd(~bill_depth_mm, data = Gentoo_penguin)/mean(~bill_depth_mm, data = Gentoo_penguin)
print(paste('Gentoo Penguin CV: ', round(Gentoo_CV,3)*100,'%'))
## [1] "Gentoo Penguin CV:  6.5 %"

As we can see, the CV’s for all species are very low, suggesting the data is not spread out in comparison their mean values. Additionally, the CV’s for all species are within ±0.4 of each other suggesting they all have a very similar spread.

Furthermore, when using Figure 6, we can see that the distribution of Gentoo penguins is much lower than Adelie or Chinstrp penguins. This suggests, once again, that there might be statistically significant differences when comparing Gentoo penguins to either Adelie or Chinstrap Penguins.

2.5 Scatterplot

Visualize the relationship of the outcome variable and the numerical explanatory variable using a scatter plot and comment.

Figure 7. Scatter plot for Penguin Bill Depth (in mm) vs Penguin Flipper Length (in mm) with Linear Regression Model

gf_point(bill_depth_mm ~ flipper_length_mm, fill = "gold", color = "black", pch = 23, cex = 2.5, alpha = 0.75, data = penguins_data) + 
  geom_smooth(method = 'lm', se = FALSE, color = "black", cex = 1) +
  labs(x = "Flipper Length (in mm)", y = "Bill Depth (in mm)",title = "Bill Depth vs. Flipper Length For Penguins") +
  theme_minimal()

In Figure 7, we generated a scatterplot to see the overall relationship between our numerical outcome variable bill depth and our numerical explanatory variable flipper length. Overall as the penguins’ flipper length increased, there was an associated decrease in the bill depth.

There does not appear to any major outliers in our graph as all points seem to be relatively close to the fitted linear regression line. However, it will be interesting to see if any outliers are present in our graph of boxplots, shown in Figure 4.

Finally, the overall trend shows two seperate clusters of points. There is one group of data points that exist from approximately 170 mm to 205 mm and another group of data points that exist from approximately 205 mm to 230 mm. Between these two groups we can see there are varying bill depths.

2.6 Colored scatterplot

Visualize the relationship of the outcome variable and both explanatory variables using a colored scatter plot and comment.

Figure 8. Scatter plot for Penguin Bill Depth (in mm) vs Penguin Flipper Length (in mm) by Penguin Species with Linear Regression Model

gf_point(bill_depth_mm ~ flipper_length_mm, color = ~species, fill = ~ species, pch = ~ species, cex = 2.5, alpha = 0.75, data = penguins_data)  + 
  geom_smooth(method = 'lm', se = FALSE, cex = 0.5) +
  scale_shape_manual(values=c(21, 22, 24)) + 
  scale_fill_manual(values=c("pink", "skyblue", "lightgreen") ) +
  scale_color_manual(values = c("darkred", "darkblue","forestgreen")) +
  labs(x = "Flipper Length (in mm)", y = "Bill Depth (in mm)", color = "Penguin Species", fill = "Penguin Species", pch = "Penguin Species",title = "Bill Depth vs. Flipper Length For Penguins", subtitle = "Comparison by Penguin Species") +
  theme_minimal()

We also generated a colored scatterplot displaying the relationship between all three variables at once in Figure 8. Taking into account the species of penguins, we can see the overall relationship has changed from an apparent negative direction to positive directions by the individual species of penguin. This suggests that species has an affect on our overall regression model.

Looking at the species of penguins, there are some outliers that have been made more evident. For Adelie penguins, there appears to be a cluster of points with a bill depth of more than 20 mm that are quite far from the fitted regression line, suggesting these could be outliers. For Chinstrap we have a couple of points around 200 mm in flipper length and 16 mm in bill depth that could be outliers, as well. Finally, for Gentoo pengins there doesn’t appear to be any major outliers, but there might be an oulier potentially at around 220 mm in flipper length and 17.5 mm in bill depth. However, we’ll need to confirm this with our boxplot analyis in Figure 4.

Finally, for an overall trend, we can clearly see that Gentoo penguins tend to have longer flippers and thinner bills overall. Whereas Adelie and Chinstrap penguins have smaller flipper lengths and deeper bill depths overall.


3. Statistical Inference

3.1 Methods

The components of our multiple linear regression model are the following:

  • Outcome variable y = Penguin bill depth (in mm)
  • Numerical explanatory variable x1 = Penguin flipper length (in mm)
  • Categorical explanatory variable x2 = Penguin species

3.2 Model Results

Table 3. Regression table of linear model of bill depth as a function of flipper length and penguin species.

penguin_multiple_model = lm(bill_depth_mm ~ flipper_length_mm + species, data = penguins_data)
get_regression_table(penguin_multiple_model)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 2.717 1.524 1.782 0.076 -0.282 5.715
flipper_length_mm 0.082 0.008 10.267 0.000 0.067 0.098
species: Chinstrap -0.409 0.151 -2.713 0.007 -0.705 -0.112
species: Gentoo -5.605 0.249 -22.546 0.000 -6.094 -5.116

3.3 Interpreting the regression table

Our regression equation is:

\[\widehat{Bill \ Depth} = 2.717 + (0.082)*(Flipper \ Length) + (-0.409)*(Species_{Chinstrap}) + (-5.605)*(Species_{Gentoo}) \]

Make a Prediction Using the Linear Model

We can use our regression equation to make a prediction using a flipper length found in our data:

Suppose we have a Gentoo penguin with a Flipper Length of 215 mm. What is the expected Bill Depth (in mm)?

Flipper_Length_calc = 215
Species_Chinstrap_calc = 0 #1 means species IS a Chinstrap, 0 means species is NOT Chinstrap
Species_Gentoo_calc = 1 #1 means species IS a Gentoo, 0 means species is NOT Gentoo
#If both Species_Chinstrap and Species_Gentoo 0, then the species is Adelie

Bill_Depth_calc = 2.717 + (0.082)*(Flipper_Length_calc) + (-0.409)*(Species_Chinstrap_calc) + (-5.605)*(Species_Gentoo_calc)
Bill_Depth_calc
## [1] 14.742

Based on our equation, we expect the the Gentoo penguin with a Flipper Length of 215 mm to have a Bill Depth of \(14.742 \ mm\).

Residual Analysis: In our data set, we have a Gentoo penguin with a flipper length of 215 mm whose Bill Depth is 14.5. Calculating our resiudal, we can see that our observed value of 14.5 mm is less than our predicted value of 14.7 mm meaning, our linear regression model overestimated our observed value. However, our calculation was off by two-tenths of a decimal place meaning our linear regression model provided a fairly good estimate for our observed data point:

\[Residual = Observed - \widehat{Predicted}\] \[Residual = 14.5 - 14.742\] \[Residual = - 0.242\]

Interpreting the Linear Model’s Coefficients

Coefficient for the \(Intercept\): \(\beta_{0} = 2.717\)

  • When a penguin’s average flipper length is equal to 0 mm and the species of penguin is Adelie, the expected average bill depth is 2.72 mm, on average, while holding all other variables in the model constant.

  • However, this does not make sense in this context because penguins cannot have flippers of 0 mm in length (assuming they have flippers). Additionally, the smallest flipper among penguins in our data set was 172 mm. Therefore, this is an example of extrapolation because we made a prediction beyond the bounds of our data.

Coefficient for the \(Flipper \ Length\): \(\beta_{1} = 0.082\)

  • For every 1 mm increase in a penguin’s flipper length, we expect the average bill depth to increase by 0.082 mm, on average, while holding all other variables in the model constant.

Coefficient for the \(Species_{Chinstrap}\): \(\beta_{2} = -0.409\)

  • The expected average bill depth for a penguin whose species is Chinstrap is lower by 0.409 mm when compared to Adelie penguins, on average, while holding all other variables in the model constant.

Coefficient for the \(Species_{Gentoo}\): \(\beta_{3} = -5.605\)

  • The expected average bill depth for a penguin whose species is Gentoo is lower by 5.605 mm when compared to Adelie penguins, on average, while holding all other variables in the model constant.

3.4 Correlation Coefficient and Coefficient of Determination

Simple Regression Model

#Create simple linear model
penguin_simple_model = lm(bill_depth_mm ~ flipper_length_mm, data = penguins_data)

Table 4. Overall Correlation Coefficient for Bill Depth and Flipper Length

#Calculate R for simple linear model
get_correlation(bill_depth_mm ~ flipper_length_mm, data = penguins_data, na.rm = TRUE)
cor
-0.5838512

Table 5. Coefficient of Determination for Simple Linear Regression

#Calculate R-Squared for simple linear model
get_regression_summaries(penguin_simple_model)
r_squared adj_r_squared mse rmse sigma statistic p_value df nobs
0.341 0.339 2.562917 1.600911 1.606 175.841 0 1 342

Consistent with the relationship found in our scatter plot from Figure 7, the overall scatter plot has a negative correlation. This is confirmed with our correlation coefficient of \(R= -0.584\).

With \(R = -0.584\), for all penguins, there is a negative and moderate correlation between Bill Depth and Flipper Length. The direction of the correlation is negative because \(R<0\). Additionally, the strength of the correlation is moderate because \(0.40 \leq R < 0.80\).

With \(R^{2} = 0.341 \rightarrow 34.1\%\), we have 34.1% of the variability in our penguin bill depth (in mm) accounted for by the relationship with penguin flipper length (in mm). This value is very far from 1.00, which means this simple linear model is not good at predicting bill depth from a penguin’s flipper length alone.

With \(1 - R^{2} = 1 - 0.341 = 0.659 \rightarrow 65.9\%\), we have approximately 66% of our variation unaccounted, which means another variable must be contributing to the variability in bill depths of penguins.

Multiple Regression Model

Assessing the Multiple Regression Model’s Strength

By adding in the variable of species, let’s see if the designed multiple linear regression model helps to account for some of the missing variable in the simple linear model above.

Table 6. Coefficient of Determination for Multiple Linear Regression

get_regression_summaries(penguin_multiple_model)
r_squared adj_r_squared mse rmse sigma statistic p_value df nobs
0.756 0.754 0.9492077 0.9742729 0.98 348.869 0 3 342

With \(R^{2} = 0.756 \rightarrow 75.6\%\), we have 75.6% of the variability in our penguin bill depth (in mm) accounted for by the relationship with penguin flipper length (in mm). This value is much closer to 1.00, which means this multiple linear model is much better at predicting bill depth from a penguin’s flipper length and species. Through including one additional variable, our model is much better at making predicitions.

Furthermore, with \(1 - R^{2} = 1 - 0.756 = 0.244 \rightarrow 24.4\%\), we have less than 25% of our variation in bill depth unaccounted! While this does mean it is likely another variable must be contributing to the variability in bill depths of penguins, we only have about a quarter of the variation left unaccounted for and can conclude that adding in species to our linear regression model greatly improved the prediction power of our model!

Additionally, our p-value from Table 2 is 0, which means we are highly statistically significant. Additionally, looking at the p-values from Table 3, all coefficients are considered statistically significant with at least 90% confidence (shown below), thus supporting that we have a fairly good model to predict the bill depth of penguins using the variables included in this model.

  • P-value for Intercept: \(0.076 < 0.10 \rightarrow\) suggests this coefficient is statistically significant with 90% level of confidence, or when the level of significance, \(\alpha = 0.10\). This means the intercept has a statistically significant impact on the model’s ability to make predictions for a penguin’s bill depth.

  • P-value for the Coefficient for Flipper Length: \(0.000 ≤ 0.00 \rightarrow\) suggests this coefficient is highly statistically with a near 100% level of confidence, or when the level of significance, \(\alpha = 0.00\). This means the Flipper Length has a statistically significant impact on the model’s ability to make predictions for a penguin’s bill depth.

  • P-value for the Coefficient for Chinstrap penguins: \(0.007 ≤ 0.00 \rightarrow\) suggests this coefficient is highly statistically with a 99% level of confidence, or when the level of significance, \(\alpha = 0.01\). This means the species type of Chinstrap has a statistically significant impact on the model’s ability to make predictions for a penguin’s bill depth.

  • P-value for the Coefficient for Chinstrap penguins: \(0.000 ≤ 0.00 \rightarrow\) suggests this coefficient is highly statistically with a near 100% level of confidence, or when the level of significance, \(\alpha = 0.00\). This means the species type of Gentoo has a statistically significant impact on the model’s ability to make predictions for a penguin’s bill depth.

Furthermore, when we filter by penguin species, we can see their correlation coefficients change to a positive direction, which reflects what we saw in our Colored Scatter plot from Figure 8.

Table 7. Correlation Coefficient for Adelie Penguins

get_correlation(bill_depth_mm ~ flipper_length_mm, data = Adelie_penguin, na.rm = TRUE)
cor
0.3076202

With \(R = 0.308\), there is a positive and weak correlation between Bill Depth and Flipper Length for Adelie penguins. The direction of the correlation is positive because \(R>0\). Additionally, the strength of the correlation is weak because \(0.10 \leq R < 0.40\).

Table 8. Correlation Coefficient for Chinstrap Penguins

get_correlation(bill_depth_mm ~ flipper_length_mm, data = Chinstrap_penguin, na.rm = TRUE)
cor
0.5801429

With \(R = 0.580\), there is a positive and moderate correlation between Bill Depth and Flipper Length for Chinstrap penguins. The direction of the correlation is positive because \(R>0\). Additionally, the strength of the correlation is moderate because \(0.40 \leq R < 0.80\).

Table 9. Correlation Coefficient for Gentoo Penguins

get_correlation(bill_depth_mm ~ flipper_length_mm, data = Gentoo_penguin, na.rm = TRUE)
cor
0.7065634

\(R = 0.707\): Again, we can see the association is now trending in the positive direction and we have a correlation strength that is moderate for this species between Bill Depth and Flipper Length since \(0.40≤R<0.80\).

With \(R = 0.707\), there is a positive and moderate correlation between Bill Depth and Flipper Length for Gentoo penguins. The direction of the correlation is positive because \(R>0\). Additionally, the strength of the correlation is moderate because \(0.40 \leq R < 0.80\). Even, though this is still a moderate correlation, Gentoo penguins have the highest correlation of all the penguin species.


4. 90% Confidence Intervals

For both confidence intervals, we have the same sample size, alpha, degrees of freedom, and t-critical value. We will calculate these first.

Sample_Size_Penguins = nrow(penguins_data)
df = (Sample_Size_Penguins - 1)
alpha = (1-(90/100))
t_critical_value = qt(1-alpha/2, df)

4.1 Numerical Outcome Variable (Bill Depth (in mm))

Sample_Mean_Bill_Depth = mean(~ bill_depth_mm, data = penguins_data)
Sample_StdDev_Bill_Depth = sd(~ bill_depth_mm, data = penguins_data)
SE_Bill_Depth = Sample_StdDev_Bill_Depth/sqrt(Sample_Size_Penguins)  #Finds the standard error for bill depth: S/√n
ME_Bill_Depth = t_critical_value*SE_Bill_Depth #Finds the margin of error for bill depth: (t-critical value)*(Standard Error)

print(paste('Margin of Error: ', round(ME_Bill_Depth,3), ' mm'))
## [1] "Margin of Error:  0.176  mm"
#Confidence Intervals have an upper and lower bound that is found by adding and subtracting the margin of error from the sample mean.
UB_Bill_Depth = Sample_Mean_Bill_Depth + ME_Bill_Depth #Upper Bound for Bill Depth
LB_Bill_Depth = Sample_Mean_Bill_Depth - ME_Bill_Depth #Lower Bound for Bill Depth

print(paste('Upper Bound: ', round(UB_Bill_Depth,3), ' mm'))
## [1] "Upper Bound:  17.327  mm"
print(paste('Lower Bound: ', round(LB_Bill_Depth,3), ' mm'))
## [1] "Lower Bound:  16.975  mm"

Margin of Error: ± 0.176 mm

90% Confidence Interval: (16.98 mm, 17.33 mm)

We are 90% confident that the true population mean length of Bill Depth for Adelie, Chinstrap and Gentoo penguins is between 16.98 mm and 17.33 mm.

4.2 Numerical Explanatory Variable (Flipper Length (in mm))

Sample_Mean_Flipper_Length = mean(~ flipper_length_mm, data = penguins_data)
Sample_StdDev_Flipper_Length = sd(~ flipper_length_mm, data = penguins_data)
SE_Flipper_Length = Sample_StdDev_Flipper_Length/sqrt(Sample_Size_Penguins)  #Finds the standard error for flipper length: S/√n
ME_Flipper_Length = t_critical_value*SE_Flipper_Length #Finds the margin of error for flipper length: (t-critical value)*(Standard Error)

print(paste('Margin of Error: ', round(ME_Flipper_Length,3), ' mm'))
## [1] "Margin of Error:  1.254  mm"
#Confidence Intervals have an upper and lower bound that is found by adding and subtracting the margin of error from the sample mean.
UB_Flipper_Length = Sample_Mean_Flipper_Length + ME_Flipper_Length #Upper Bound for Flipper Length
LB_Flipper_Length = Sample_Mean_Flipper_Length - ME_Flipper_Length #Lower Bound for Flipper Length

print(paste('Upper Bound: ', round(UB_Flipper_Length,3), ' mm'))
## [1] "Upper Bound:  202.169  mm"
print(paste('Lower Bound: ', round(LB_Flipper_Length,3), ' mm'))
## [1] "Lower Bound:  199.661  mm"

Margin of Error: ± 1.254 mm

90% Confidence Interval: (199.66 mm, 202.17 mm)

We are 90% confident that the true population mean length of Flipper Length for Adelie, Chinstrap and Gentoo penguins is between 199.66 mm and 202.17 mm.


5. Conclusions

5.1 Summary

We found that there was a significant association between flipper length and bill depth among Antarctic penguins, after controlling for penguin species, and among the species, Gentoo penguins were associated with significantly smaller bill depths, on average. The linear model indicated that, after controlling for species, each millimeter in flipper length was associated with an average increase in bill depth of about 0.082 mm. This, however, does not mean that changes in flipper length cause changes in bill depth, merely that they are associated.

Among the species, Gentoo penguins were associated with significantly smaller bill depths, on average. Generally, Adelie penguins had an average bill depth of 18.40 mm, Chinstrap penguins had an average bill depth of 18.45 mm, which are fairly close to each other. But, Gentoo penguins had an average of 15.00 mm, which is a difference of more than 3 mm for the other two species. However, for all penguins the general spread of the data was small with relatively small standard deviations of 1.22 mm, 1.14 mm, and 0.98 mm for Adelie, Chinstrap, and Gentoo penguins respectively. Additionally, after analyzing the Coefficient of Variations for each specie, the variability in relation to the mean is very low.

Additionally, when species is not considered, the overall trend shows a negative association, suggesting that as flipper length increase, the bill depth decreases. However, for both species of penguin this is not the reality. When considering species, we can see that the association is in-fact positive! Furthermore, we saw that the linear model indicated that, after controlling for species, each millimeter in flipper length was associated with an average increase in bill depth of about 0.082 mm.

Finally, we saw from our R-Squared value of 75.6% that we developed a relatively good model for predicting the bill depth using a penguin’s flipper length and species. Furthermore, we saw from our confidence intervals that most penguins have bill depths between 16.98 mm and 17.33 mm, and flipper lengths between 199.66 mm and 202.17 mm. However, when looking at these ranges in comparison to our scatter plots, we can see this does not truly encompass what we are seeing by species. Thereby, confirming that necessity to control for species in this analysis.

5.2 Statistical Significance

In our Exploratory data analysis, we began to see differences between species. The differences were notable starting with our summary statistics, where we saw large differences in the means between Adelie and Chinstrap when compared to Gentoo penguins. Furthermore, the histogram grouped by species showcased a difference in the centers when comparing Adelie and Chinstrap to the Gentoo penguins, where Adelie and Chinstrap penguins had much higher centers than Gentoo penguins overall.

This was also seen in the box plots comparing Adelie and Chinstrap with Gentoo penguins, where the two boxes of Adelie and Chinstrap penguins did not overlap at all with the Gentoo penguins box. In this boxplot, we also saw that the median lines for Adelie and Chinstrap penguins were much higher and farther away than the lower Gentoo penguins median. Finally, in our scatter plot grouped by our species, we could visually see two clusters of points that showed a distinct difference between the two groupings of species: Adelie/Chinstrap vs Gentoo penguins.

In our Multiple Linear Regression Model, we saw statistical significance when we examined our correlations and model coefficients. When examining the correlation coefficients by species, we had p-values that were both below the alpha level of significance of 0.05 suggesting that the correlations between bill depth and flipper length were statistically significant for both species of penguins.

Overall, these findings suggest that penguin species are not all uniform and are associated with different characteristics. However after we performed residual analysis after fitting the model, we found some evidence of a slight violation of the “constant variability among residuals” condition, likely due to more variability among Adelie penguins.

5.3 Future Work

The overall design of the study was designed well. Our sample size of 342 penguins was much larger than 30, so we had a large enough sample size to use statistical inference make conclusions. Furthermore, two of our species were well-represented with a near even proportion across Gentoo and Adelie penguin species, and Chinstrap penguins still represented a large enough proportion (about 20%) of the overall data set.

Furthermore, the scatter plot and grouped scatter plot analysis showed a relatively linear relationship between bill depth and flipper length. The points did not appear to follow a curve, exponential, logarithmic, quadratic, or trigonometric function. Therefore, utilizing a linear model to make predictions for bill depth seems to be an appropriate model to utilize since the points are increasing in a linear fashion.

Finally, this study is limited to the three penguin species included and only examined the effect two variables (flipper length and species) had on bill depth. Since there are so many other species of penguins, it might be beneficial to add in additional species to see if there is a continued trend of statistically significant differences between species. This could lead to further understanding of how penguins use their flipper lengths and bill depths to survive in harsh climates.

Additionally, with about a quarter of the variation in bill depth not accounted for in our multiple linear regression model, there might other variables that could be added to the model to increase the amount of variability accounted for. The other variables that could be examined include penguin gender, penguin height, habitat location, or even type of diet (such as the type of fish they may consume). Adding in these additional variables might account for the missing variability in bill depth and could lead to more accurate predictions made by the model.


References

  1. Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.

  2. Penguin | Features, habitat, & facts. (2023, October 18). Encyclopedia Britannica. https://www.britannica.com/animal/penguin/Natural-history

  3. Empirical Rule (68-95-99.7) & Empirical Research - Statistics How to. (2023, March 9). Statistics How To. https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/empirical-rule/