My pseudo-dataset is fashioned on variables I often encounter in my projects for my thesis work. I work with species functional traits and how those traits affect some life history activity of the species. In this case, I am making a dataset on measured traits of species within a rich fen plant community in Vermont.
A “rich fen” is a type of wetland characterized by a high concentration of minerals in its groundwater, resulting in a diverse plant community and typically a higher pH compared to other fens; it is essentially a fen that is fed by mineral-rich water, usually containing high levels of calcium, creating a more alkaline environment compared to a “poor fen” which receives less mineral-rich water. Rich Fens typically occur on gentle slopes and have shallow peat accumulations of less than three feet, although in some cases the peat is deeper. The peat is saturated throughout the growing season, and there may be small, shallow pools scattered over the generally concave surface of the fen.
Rich Fens are dominated by brown mosses and low sedges and grasses. Characteristic mosses are starry campylium, Scorpidium revolvens, Calliergonella cuspidata, Philonotis fontana, Ptychostomum pseudotriquetrum, and Tomenthypnum nitens. The low, herbaceous cover is primarily sedges, with inland sedge, porcupine sedge, yellow sedge, and Hudson Bay bulrush present in most fens. Other characteristic herbs include bristle-stalked sedge, water avens, single-spike muhlenbergia, Kalm’s lobelia, golden ragwort, and blue flag. Many other herbaceous plants may be present. Red-osier dogwood is a shrub that occurs in most Rich Fens but is seldom very abundant. Shrubby cinquefoil and alder-leaved buckthorn are scattered across many fens, and they are locally abundant in others.
In this study, we sampled species traits along with other environmental variables from single 2m x 2m quadrats set up in three parts of the Eshqua Bog Natural Area. One part with invasive plant species present, the second with invasives previously present but removed and the last without invasives present.
There will be a significant difference in the traits of species found within the different sites sampled and this will be influenced by the presence or absence of invasive plants within the different habitats.
Names of the Sites Sampled From = site_id
Names of the Species sampled = species_names
Plant Height = plant_height
Stem Width = stem_width
Number of Flowers = number_of_flowers
Number of Leaves = number_of_leaves
Number of Fruits = number_of_fruits
soil pH = soil_pH
Calcium concentrations within the soils in the sampled quadrats = Ca_concentration
For more information about rich fen plant communities, please visit ANR’s website on rich fens in Vermont
library(skimr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
site_id <- c("invaded", "non-invaded", "removed")
species_name <- c("Starry campylium", "Scorpidium revolvens", "Calliergonella cuspidata", "Bebris thunbergii", "Reynoutria japonica", "inland sedge", "porcupine sedge", "yellow sedge", "bristle-stalked sedge", "water avens", "Philonotis fontana", "Ptychostomum pseudotriquetrum", "Carex schweinitzii", "Eriophorum gracile", "Malaxis unifolia", "Elodium blandowii", "Salix pedicellaris", "Pyrola asarifolia ssp. asarifolia", "Cladium mariscoides", "Salix candida", "bristle-stalked sedge", "Starry campylium", "Scorpidium revolvens", "Reynoutria japonica", "Elodium blandowii", "inland sedge", "yellow sedge", "Malaxis unifolia", "Cypripedium parviflorum var. makasin", "Meesia triquetra")
plant_height <- c(round(rnorm(30, mean=15)))
print(plant_height)
## [1] 15 13 16 14 14 17 16 15 16 15 16 16 17 15 14 13 15 15 13 15 15 14 14 15 14
## [26] 16 16 16 15 14
stem_width <- c(runif(30))
print(stem_width)
## [1] 0.28100728 0.93440328 0.82222774 0.07240229 0.41737969 0.90075674
## [7] 0.80201923 0.12082624 0.35281072 0.83965886 0.69739285 0.49007923
## [13] 0.04414660 0.97920091 0.18392634 0.09298427 0.22861106 0.71188001
## [19] 0.37551254 0.37153156 0.14617333 0.55383641 0.89122724 0.53960237
## [25] 0.71807039 0.01337633 0.10658419 0.29943843 0.31452851 0.55065740
number_of_leaves <- c(round(rnorm(30, mean=15)))
print(number_of_leaves)
## [1] 16 15 15 16 15 15 14 14 16 15 14 15 16 16 15 14 17 15 15 15 16 15 14 15 15
## [26] 15 14 15 15 17
number_of_flowers <- c(round(rnorm(30, mean = 5)))
print(number_of_flowers)
## [1] 3 5 5 5 5 7 7 5 5 5 4 4 6 5 5 5 3 5 3 7 5 5 5 5 4 5 5 6 5 5
number_of_fruits <- c(round(rnorm(30, mean=10)))
print(number_of_fruits)
## [1] 10 10 10 9 9 10 8 8 9 8 11 9 11 9 9 10 10 9 10 11 8 8 12 10 9
## [26] 10 7 10 9 10
soil_pH <- c(runif(30, max = 7.8, min = 5.8)) #this is usually the range of pH within rich fen plant communities
print(soil_pH)
## [1] 6.179223 5.924271 6.953491 6.116898 5.991086 7.276564 7.374307 7.671802
## [9] 7.070537 7.567088 6.898847 5.983518 6.580458 7.647405 7.555940 7.555285
## [17] 6.537887 6.743895 7.316962 7.438233 7.182692 7.023356 7.796270 6.453866
## [25] 6.962816 6.103290 6.184326 7.539018 6.854978 7.322008
Ca_concentration <- c(runif(30))
print(Ca_concentration)
## [1] 0.887074623 0.288292418 0.008934012 0.222280391 0.804271344 0.570927101
## [7] 0.691769784 0.998101380 0.734300998 0.971595553 0.605159111 0.777046305
## [13] 0.725952657 0.675790500 0.504934371 0.063716586 0.670083543 0.782609522
## [19] 0.684893243 0.875211860 0.347135382 0.985216903 0.959211319 0.892157953
## [25] 0.727104066 0.165423878 0.121526581 0.511594281 0.411209329 0.326339227
# make the dataframe
data.frame(site_id, species_name, plant_height, stem_width, number_of_leaves, number_of_flowers, number_of_fruits, soil_pH, Ca_concentration) -> trait_data
view(trait_data) # awesome, fake dataset created!
skim(trait_data)
Name | trait_data |
Number of rows | 30 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
site_id | 0 | 1 | 7 | 11 | 0 | 3 | 0 |
species_name | 0 | 1 | 11 | 36 | 0 | 22 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
plant_height | 0 | 1 | 14.97 | 1.10 | 13.00 | 14.00 | 15.00 | 16.00 | 17.00 | ▂▆▇▆▂ |
stem_width | 0 | 1 | 0.46 | 0.31 | 0.01 | 0.20 | 0.40 | 0.72 | 0.98 | ▇▇▅▃▇ |
number_of_leaves | 0 | 1 | 15.13 | 0.82 | 14.00 | 15.00 | 15.00 | 15.75 | 17.00 | ▃▇▁▃▁ |
number_of_flowers | 0 | 1 | 4.97 | 1.00 | 3.00 | 5.00 | 5.00 | 5.00 | 7.00 | ▁▁▇▁▁ |
number_of_fruits | 0 | 1 | 9.43 | 1.10 | 7.00 | 9.00 | 9.50 | 10.00 | 12.00 | ▅▆▇▂▁ |
soil_pH | 0 | 1 | 6.93 | 0.59 | 5.92 | 6.47 | 6.99 | 7.42 | 7.80 | ▇▃▆▆▇ |
Ca_concentration | 0 | 1 | 0.60 | 0.29 | 0.01 | 0.36 | 0.68 | 0.80 | 1.00 | ▃▃▃▇▆ |
summary(trait_data)
## site_id species_name plant_height stem_width
## Length:30 Length:30 Min. :13.00 Min. :0.01338
## Class :character Class :character 1st Qu.:14.00 1st Qu.:0.19510
## Mode :character Mode :character Median :15.00 Median :0.39645
## Mean :14.97 Mean :0.46174
## 3rd Qu.:16.00 3rd Qu.:0.71652
## Max. :17.00 Max. :0.97920
## number_of_leaves number_of_flowers number_of_fruits soil_pH
## Min. :14.00 Min. :3.000 Min. : 7.000 Min. :5.924
## 1st Qu.:15.00 1st Qu.:5.000 1st Qu.: 9.000 1st Qu.:6.475
## Median :15.00 Median :5.000 Median : 9.500 Median :6.993
## Mean :15.13 Mean :4.967 Mean : 9.433 Mean :6.927
## 3rd Qu.:15.75 3rd Qu.:5.000 3rd Qu.:10.000 3rd Qu.:7.422
## Max. :17.00 Max. :7.000 Max. :12.000 Max. :7.796
## Ca_concentration
## Min. :0.008934
## 1st Qu.:0.363154
## Median :0.680342
## Mean :0.599662
## 3rd Qu.:0.798856
## Max. :0.998101
# make a presence and absence column for the invasive species
trait_data$invasive_presence <- ifelse(trait_data$species_name %in% c("Bebris thunbergii", "Reynoutria japonica"), 1, 0)
# The most basic thing you could when exploring your data would be some kind of box plots (depending on your type of data of course!), so that what i'll do in this case. But first;
# let's convert our data into a longer format
long_data <- trait_data %>% pivot_longer(!c(site_id, species_name), names_to = "trait", values_to = "value")
# visualize traits by sites in a single figure
ggplot(data=long_data,
aes(x=trait,y=value, fill=site_id))+
geom_boxplot() +
theme_classic() +
theme(axis.text.x = element_blank()) +
facet_wrap(trait~.,scale = "free")
# looks like invasive species do leave behind an effect on the traits of species within the habitats they are found based on these box plots. But are these real differences? We will have to test our data for actual differences.
manova <- manova(cbind(plant_height, stem_width, number_of_leaves, number_of_flowers, number_of_fruits) ~ site_id, data = trait_data)
summary(manova) #looks like our box plots deceived us after all and our hypothesis may not be correct after all, plant traits are similar across sites regardless of the presence of invasives (p>0.1899).
## Df Pillai approx F num Df den Df Pr(>F)
## site_id 2 0.23105 0.62695 10 48 0.7834
## Residuals 27
mod1 <- glm(invasive_presence ~ plant_height + stem_width + number_of_leaves + number_of_flowers + number_of_fruits, data = trait_data)
summary(mod1)
##
## Call:
## glm(formula = invasive_presence ~ plant_height + stem_width +
## number_of_leaves + number_of_flowers + number_of_fruits,
## data = trait_data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.489282 1.476732 0.331 0.743
## plant_height -0.071671 0.059240 -1.210 0.238
## stem_width -0.141534 0.201682 -0.702 0.490
## number_of_leaves 0.038001 0.076623 0.496 0.624
## number_of_flowers 0.046880 0.067560 0.694 0.494
## number_of_fruits -0.006273 0.054882 -0.114 0.910
##
## (Dispersion parameter for gaussian family taken to be 0.1035038)
##
## Null deviance: 2.7000 on 29 degrees of freedom
## Residual deviance: 2.4841 on 24 degrees of freedom
## AIC: 24.398
##
## Number of Fisher Scoring iterations: 2
plot(mod1)
# looks like our hypothesis will be rejected as there aren't any significant differences in traits across the sites irrespective of the presence of invasives. However, i wonder if the results will be the same if we increased our sample sizes.
set.seed(123)
# Set up storage for results
results <- data.frame(sample_size = integer(),
effect_size = numeric(),
trait = character(),
p_value = numeric())
# Define the traits to analyze
traits <- c("plant_height", "stem_width", "number_of_leaves",
"number_of_flowers", "number_of_fruits")
# randomly sample data from trait_data, and run a linear mixed effects model on them
for (n in seq(10, nrow(trait_data), by = 5)) { # Increase sample size by 5 each iteration
# Randomly sample n observations
temp_data <- trait_data[sample(1:nrow(trait_data), n, replace = TRUE), ]
# Fit logistic regression model
mod <- glm(invasive_presence ~ plant_height + stem_width + number_of_leaves +
number_of_flowers + number_of_fruits, family = binomial, data = temp_data)
# Extract effect sizes and p-values for all traits
for (trait in traits) {
effect_size <- coef(mod)[trait]
p_value <- summary(mod)$coefficients[trait, 4] # p-value for the trait
# Store results
results <- rbind(results, data.frame(sample_size = n, trait = trait,
effect_size = effect_size, p_value = p_value))
}
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(results) # looks like to actually detect real difference i might want to further increase sample size beyond 30 because there's nothing significant popping up. Hopefully my graphs tell a better narrative of my observations.
## sample_size trait effect_size p_value
## plant_height 10 plant_height -18.5332267 0.9972969
## stem_width 10 stem_width 1.6086051 0.8763597
## number_of_leaves 10 number_of_leaves -10.4848255 0.9999998
## number_of_flowers 10 number_of_flowers 9.3813917 0.9972634
## number_of_fruits 10 number_of_fruits -0.4527097 0.7396533
## plant_height1 15 plant_height -80.8796463 0.9994660
## stem_width1 15 stem_width -250.0291975 0.9998209
## number_of_leaves1 15 number_of_leaves 30.8753950 0.9999580
## number_of_flowers1 15 number_of_flowers 74.6277971 0.9995312
## number_of_fruits1 15 number_of_fruits 11.4362910 0.9999389
## plant_height2 20 plant_height -26.4745363 0.9995847
## stem_width2 20 stem_width -164.1496877 0.9993961
## number_of_leaves2 20 number_of_leaves 26.3623151 0.9995585
## number_of_flowers2 20 number_of_flowers 14.6976330 0.9999190
## number_of_fruits2 20 number_of_fruits 6.5331718 0.9998921
## plant_height3 25 plant_height -50.1627082 0.9971155
## stem_width3 25 stem_width -1.0098463 0.8648878
## number_of_leaves3 25 number_of_leaves -0.2616747 0.9037458
## number_of_flowers3 25 number_of_flowers 17.2768117 0.9975784
## number_of_fruits3 25 number_of_fruits -48.7532383 0.9971966
## plant_height4 30 plant_height -1.7187274 0.2805717
## stem_width4 30 stem_width -8.6949963 0.1907439
## number_of_leaves4 30 number_of_leaves 2.0421660 0.1790735
## number_of_flowers4 30 number_of_flowers 1.7117770 0.4268387
## number_of_fruits4 30 number_of_fruits -1.7046172 0.2392787
# Plotting P-Values for Each Trait Across Sample Sizes
ggplot(results, aes(x = sample_size, y = p_value, color = trait)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
labs(title = "Impact of Sample Size on the Statistical Significance for All Traits",
x = "Sample Size", y = "P-Value") +
theme_classic()
# Plotting Effect Sizes for Each Trait Across Sample Sizes
ggplot(results, aes(x = sample_size, y = effect_size, color = trait)) +
geom_point() +
geom_line() +
labs(title = "Effect Size vs. Sample Size for All Traits",
x = "Sample Size", y = "Effect Size (Estimate)") +
theme_classic()
# yep! Never lied! But mind you, this is for an increase by 5 at each iteration, what if we increased it by 10?
set.seed(123)
# Set up storage for results
results1 <- data.frame(sample_size = integer(),
effect_size = numeric(),
trait = character(),
p_value = numeric())
# Define the traits to analyze
traits <- c("plant_height", "stem_width", "number_of_leaves",
"number_of_flowers", "number_of_fruits")
# randomly sample data from trait_data, and run a linear mixed effects model on them
for (n in seq(10, nrow(trait_data), by = 10)) { # Increase sample size by 10 each iteration
# Randomly sample n observations
temp_data <- trait_data[sample(1:nrow(trait_data), n, replace = TRUE), ]
# Fit logistic regression model
mod1 <- glm(invasive_presence ~ plant_height + stem_width + number_of_leaves +
number_of_flowers + number_of_fruits, family = binomial, data = temp_data)
# Extract effect sizes and p-values for all traits
for (trait in traits) {
effect_size1 <- coef(mod1)[trait]
p_value1 <- summary(mod1)$coefficients[trait, 4] # p-value for the trait
# Store results
results1 <- rbind(results, data.frame(sample_size = n, trait = trait,
effect_size = effect_size1, p_value = p_value1))
}
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(results1)
## sample_size trait effect_size p_value
## plant_height 10 plant_height -18.5332267 0.9972969
## stem_width 10 stem_width 1.6086051 0.8763597
## number_of_leaves 10 number_of_leaves -10.4848255 0.9999998
## number_of_flowers 10 number_of_flowers 9.3813917 0.9972634
## number_of_fruits 10 number_of_fruits -0.4527097 0.7396533
## plant_height1 15 plant_height -80.8796463 0.9994660
## stem_width1 15 stem_width -250.0291975 0.9998209
## number_of_leaves1 15 number_of_leaves 30.8753950 0.9999580
## number_of_flowers1 15 number_of_flowers 74.6277971 0.9995312
## number_of_fruits1 15 number_of_fruits 11.4362910 0.9999389
## plant_height2 20 plant_height -26.4745363 0.9995847
## stem_width2 20 stem_width -164.1496877 0.9993961
## number_of_leaves2 20 number_of_leaves 26.3623151 0.9995585
## number_of_flowers2 20 number_of_flowers 14.6976330 0.9999190
## number_of_fruits2 20 number_of_fruits 6.5331718 0.9998921
## plant_height3 25 plant_height -50.1627082 0.9971155
## stem_width3 25 stem_width -1.0098463 0.8648878
## number_of_leaves3 25 number_of_leaves -0.2616747 0.9037458
## number_of_flowers3 25 number_of_flowers 17.2768117 0.9975784
## number_of_fruits3 25 number_of_fruits -48.7532383 0.9971966
## plant_height4 30 plant_height -1.7187274 0.2805717
## stem_width4 30 stem_width -8.6949963 0.1907439
## number_of_leaves4 30 number_of_leaves 2.0421660 0.1790735
## number_of_flowers4 30 number_of_flowers 1.7117770 0.4268387
## number_of_fruits4 30 number_of_fruits -1.7046172 0.2392787
## number_of_fruits5 30 number_of_fruits -39.8042697 0.9982792
# Plotting P-Values for Each Trait Across Sample Sizes
ggplot(results1, aes(x = sample_size, y = p_value, color = trait)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
labs(title = "Impact of Sample Size on the Statistical Significance for All Traits",
x = "Sample Size", y = "P-Value") +
theme_classic()
# Plotting Effect Sizes for Each Trait Across Sample Sizes
ggplot(results1, aes(x = sample_size, y = effect_size, color = trait)) +
geom_point() +
geom_line() +
labs(title = "Effect Size vs. Sample Size for All Traits",
x = "Sample Size", y = "Effect Size (Estimate)") +
theme_classic() # don't think nothing changed!