Description of Dataset

My pseudo-dataset is fashioned on variables I often encounter in my projects for my thesis work. I work with species functional traits and how those traits affect some life history activity of the species. In this case, I am making a dataset on measured traits of species within a rich fen plant community in Vermont.

A “rich fen” is a type of wetland characterized by a high concentration of minerals in its groundwater, resulting in a diverse plant community and typically a higher pH compared to other fens; it is essentially a fen that is fed by mineral-rich water, usually containing high levels of calcium, creating a more alkaline environment compared to a “poor fen” which receives less mineral-rich water. Rich Fens typically occur on gentle slopes and have shallow peat accumulations of less than three feet, although in some cases the peat is deeper. The peat is saturated throughout the growing season, and there may be small, shallow pools scattered over the generally concave surface of the fen.

Rich Fens are dominated by brown mosses and low sedges and grasses. Characteristic mosses are starry campylium, Scorpidium revolvens, Calliergonella cuspidata, Philonotis fontana, Ptychostomum pseudotriquetrum, and Tomenthypnum nitens. The low, herbaceous cover is primarily sedges, with inland sedge, porcupine sedge, yellow sedge, and Hudson Bay bulrush present in most fens. Other characteristic herbs include bristle-stalked sedge, water avens, single-spike muhlenbergia, Kalm’s lobelia, golden ragwort, and blue flag. Many other herbaceous plants may be present. Red-osier dogwood is a shrub that occurs in most Rich Fens but is seldom very abundant. Shrubby cinquefoil and alder-leaved buckthorn are scattered across many fens, and they are locally abundant in others.

Study Area and Design

In this study, we sampled species traits along with other environmental variables from single 2m x 2m quadrats set up in three parts of the Eshqua Bog Natural Area. One part with invasive plant species present, the second with invasives previously present but removed and the last without invasives present.

Hypothesis:

There will be a significant difference in the traits of species found within the different sites sampled and this will be influenced by the presence or absence of invasive plants within the different habitats.

Variables in my dataset:

Names of the Sites Sampled From = site_id

Names of the Species sampled = species_names

Plant Height = plant_height

Stem Width = stem_width

Number of Flowers = number_of_flowers

Number of Leaves = number_of_leaves

Number of Fruits = number_of_fruits

soil pH = soil_pH

Calcium concentrations within the soils in the sampled quadrats = Ca_concentration

For more information about rich fen plant communities, please visit ANR’s website on rich fens in Vermont

loading necessary libraries

library(skimr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Create the fake dataset (data simulation)

site_id <- c("invaded", "non-invaded", "removed")

species_name <- c("Starry campylium", "Scorpidium revolvens", "Calliergonella cuspidata", "Bebris thunbergii", "Reynoutria japonica", "inland sedge", "porcupine sedge", "yellow sedge", "bristle-stalked sedge", "water avens", "Philonotis fontana", "Ptychostomum pseudotriquetrum", "Carex schweinitzii", "Eriophorum gracile", "Malaxis unifolia", "Elodium blandowii", "Salix pedicellaris", "Pyrola asarifolia ssp. asarifolia", "Cladium mariscoides", "Salix candida", "bristle-stalked sedge", "Starry campylium", "Scorpidium revolvens", "Reynoutria japonica", "Elodium blandowii", "inland sedge", "yellow sedge", "Malaxis unifolia", "Cypripedium parviflorum var. makasin", "Meesia triquetra")

plant_height <- c(round(rnorm(30, mean=15)))

print(plant_height)
##  [1] 15 13 16 14 14 17 16 15 16 15 16 16 17 15 14 13 15 15 13 15 15 14 14 15 14
## [26] 16 16 16 15 14
stem_width <- c(runif(30))
print(stem_width)
##  [1] 0.28100728 0.93440328 0.82222774 0.07240229 0.41737969 0.90075674
##  [7] 0.80201923 0.12082624 0.35281072 0.83965886 0.69739285 0.49007923
## [13] 0.04414660 0.97920091 0.18392634 0.09298427 0.22861106 0.71188001
## [19] 0.37551254 0.37153156 0.14617333 0.55383641 0.89122724 0.53960237
## [25] 0.71807039 0.01337633 0.10658419 0.29943843 0.31452851 0.55065740
number_of_leaves <- c(round(rnorm(30, mean=15)))

print(number_of_leaves)
##  [1] 16 15 15 16 15 15 14 14 16 15 14 15 16 16 15 14 17 15 15 15 16 15 14 15 15
## [26] 15 14 15 15 17
number_of_flowers <- c(round(rnorm(30, mean = 5)))
print(number_of_flowers)
##  [1] 3 5 5 5 5 7 7 5 5 5 4 4 6 5 5 5 3 5 3 7 5 5 5 5 4 5 5 6 5 5
number_of_fruits <- c(round(rnorm(30, mean=10)))
print(number_of_fruits)
##  [1] 10 10 10  9  9 10  8  8  9  8 11  9 11  9  9 10 10  9 10 11  8  8 12 10  9
## [26] 10  7 10  9 10
soil_pH <- c(runif(30, max = 7.8, min = 5.8)) #this is usually the range of pH within rich fen plant communities
print(soil_pH)
##  [1] 6.179223 5.924271 6.953491 6.116898 5.991086 7.276564 7.374307 7.671802
##  [9] 7.070537 7.567088 6.898847 5.983518 6.580458 7.647405 7.555940 7.555285
## [17] 6.537887 6.743895 7.316962 7.438233 7.182692 7.023356 7.796270 6.453866
## [25] 6.962816 6.103290 6.184326 7.539018 6.854978 7.322008
Ca_concentration <- c(runif(30))
print(Ca_concentration)
##  [1] 0.887074623 0.288292418 0.008934012 0.222280391 0.804271344 0.570927101
##  [7] 0.691769784 0.998101380 0.734300998 0.971595553 0.605159111 0.777046305
## [13] 0.725952657 0.675790500 0.504934371 0.063716586 0.670083543 0.782609522
## [19] 0.684893243 0.875211860 0.347135382 0.985216903 0.959211319 0.892157953
## [25] 0.727104066 0.165423878 0.121526581 0.511594281 0.411209329 0.326339227
# make the dataframe
data.frame(site_id, species_name, plant_height, stem_width, number_of_leaves, number_of_flowers, number_of_fruits, soil_pH, Ca_concentration) -> trait_data 
  
view(trait_data) # awesome, fake dataset created!

skim(trait_data)
Data summary
Name trait_data
Number of rows 30
Number of columns 9
_______________________
Column type frequency:
character 2
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
site_id 0 1 7 11 0 3 0
species_name 0 1 11 36 0 22 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
plant_height 0 1 14.97 1.10 13.00 14.00 15.00 16.00 17.00 ▂▆▇▆▂
stem_width 0 1 0.46 0.31 0.01 0.20 0.40 0.72 0.98 ▇▇▅▃▇
number_of_leaves 0 1 15.13 0.82 14.00 15.00 15.00 15.75 17.00 ▃▇▁▃▁
number_of_flowers 0 1 4.97 1.00 3.00 5.00 5.00 5.00 7.00 ▁▁▇▁▁
number_of_fruits 0 1 9.43 1.10 7.00 9.00 9.50 10.00 12.00 ▅▆▇▂▁
soil_pH 0 1 6.93 0.59 5.92 6.47 6.99 7.42 7.80 ▇▃▆▆▇
Ca_concentration 0 1 0.60 0.29 0.01 0.36 0.68 0.80 1.00 ▃▃▃▇▆
summary(trait_data)
##    site_id          species_name        plant_height     stem_width     
##  Length:30          Length:30          Min.   :13.00   Min.   :0.01338  
##  Class :character   Class :character   1st Qu.:14.00   1st Qu.:0.19510  
##  Mode  :character   Mode  :character   Median :15.00   Median :0.39645  
##                                        Mean   :14.97   Mean   :0.46174  
##                                        3rd Qu.:16.00   3rd Qu.:0.71652  
##                                        Max.   :17.00   Max.   :0.97920  
##  number_of_leaves number_of_flowers number_of_fruits    soil_pH     
##  Min.   :14.00    Min.   :3.000     Min.   : 7.000   Min.   :5.924  
##  1st Qu.:15.00    1st Qu.:5.000     1st Qu.: 9.000   1st Qu.:6.475  
##  Median :15.00    Median :5.000     Median : 9.500   Median :6.993  
##  Mean   :15.13    Mean   :4.967     Mean   : 9.433   Mean   :6.927  
##  3rd Qu.:15.75    3rd Qu.:5.000     3rd Qu.:10.000   3rd Qu.:7.422  
##  Max.   :17.00    Max.   :7.000     Max.   :12.000   Max.   :7.796  
##  Ca_concentration  
##  Min.   :0.008934  
##  1st Qu.:0.363154  
##  Median :0.680342  
##  Mean   :0.599662  
##  3rd Qu.:0.798856  
##  Max.   :0.998101
# make a presence and absence column for the invasive species
trait_data$invasive_presence <- ifelse(trait_data$species_name %in% c("Bebris thunbergii", "Reynoutria japonica"), 1, 0)

Exploring our data

# The most basic thing you could when exploring your data would be some kind of box plots (depending on your type of data of course!), so that what i'll do in this case. But first;

# let's convert our data into a longer format
long_data <- trait_data %>% pivot_longer(!c(site_id, species_name), names_to = "trait", values_to = "value")

# visualize traits by sites in a single figure
ggplot(data=long_data, 
       aes(x=trait,y=value, fill=site_id))+
  geom_boxplot() + 
  theme_classic() + 
  theme(axis.text.x = element_blank()) +
  facet_wrap(trait~.,scale = "free")

# looks like invasive species do leave behind an effect on the traits of species within the habitats they are found based on these box plots. But are these real differences? We will have to test our data for actual differences.

Testing for differences in traits across sites using a MANOVA

manova <- manova(cbind(plant_height, stem_width, number_of_leaves, number_of_flowers, number_of_fruits) ~ site_id, data = trait_data)

summary(manova) #looks like our box plots deceived us after all and our hypothesis may not be correct after all, plant traits are similar across sites regardless of the presence of invasives (p>0.1899).
##           Df  Pillai approx F num Df den Df Pr(>F)
## site_id    2 0.23105  0.62695     10     48 0.7834
## Residuals 27

Testing for the effects of invasives on species traits using a logistic regression model

mod1 <- glm(invasive_presence ~ plant_height + stem_width + number_of_leaves + number_of_flowers + number_of_fruits, data = trait_data)

summary(mod1)
## 
## Call:
## glm(formula = invasive_presence ~ plant_height + stem_width + 
##     number_of_leaves + number_of_flowers + number_of_fruits, 
##     data = trait_data)
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)        0.489282   1.476732   0.331    0.743
## plant_height      -0.071671   0.059240  -1.210    0.238
## stem_width        -0.141534   0.201682  -0.702    0.490
## number_of_leaves   0.038001   0.076623   0.496    0.624
## number_of_flowers  0.046880   0.067560   0.694    0.494
## number_of_fruits  -0.006273   0.054882  -0.114    0.910
## 
## (Dispersion parameter for gaussian family taken to be 0.1035038)
## 
##     Null deviance: 2.7000  on 29  degrees of freedom
## Residual deviance: 2.4841  on 24  degrees of freedom
## AIC: 24.398
## 
## Number of Fisher Scoring iterations: 2
plot(mod1)

# looks like our hypothesis will be rejected as there aren't any significant differences in traits across the sites irrespective of the presence of invasives. However, i wonder if the results will be the same if we increased our sample sizes.

Explore Sample size effects when we increase sample size by 5 each iteration

set.seed(123)
# Set up storage for results
results <- data.frame(sample_size = integer(),
                      effect_size = numeric(),
                       trait = character(),
                      p_value = numeric())

# Define the traits to analyze
traits <- c("plant_height", "stem_width", "number_of_leaves", 
            "number_of_flowers", "number_of_fruits")

# randomly sample data from trait_data, and run a linear mixed effects model on them
for (n in seq(10, nrow(trait_data), by = 5)) {  # Increase sample size by 5 each iteration

  # Randomly sample n observations
  temp_data <- trait_data[sample(1:nrow(trait_data), n, replace = TRUE), ]

  # Fit  logistic regression model
  mod <- glm(invasive_presence ~ plant_height + stem_width + number_of_leaves + 
             number_of_flowers + number_of_fruits, family = binomial, data = temp_data)
  
  # Extract effect sizes and p-values for all traits
  for (trait in traits) {
    effect_size <- coef(mod)[trait]
    p_value <- summary(mod)$coefficients[trait, 4]  # p-value for the trait

    # Store results
    results <- rbind(results, data.frame(sample_size = n, trait = trait, 
                                         effect_size = effect_size, p_value = p_value))
  }
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(results) # looks like to actually detect real difference i might want to further increase sample size beyond 30 because there's nothing significant popping up. Hopefully my graphs tell a better narrative of my observations.
##                    sample_size             trait  effect_size   p_value
## plant_height                10      plant_height  -18.5332267 0.9972969
## stem_width                  10        stem_width    1.6086051 0.8763597
## number_of_leaves            10  number_of_leaves  -10.4848255 0.9999998
## number_of_flowers           10 number_of_flowers    9.3813917 0.9972634
## number_of_fruits            10  number_of_fruits   -0.4527097 0.7396533
## plant_height1               15      plant_height  -80.8796463 0.9994660
## stem_width1                 15        stem_width -250.0291975 0.9998209
## number_of_leaves1           15  number_of_leaves   30.8753950 0.9999580
## number_of_flowers1          15 number_of_flowers   74.6277971 0.9995312
## number_of_fruits1           15  number_of_fruits   11.4362910 0.9999389
## plant_height2               20      plant_height  -26.4745363 0.9995847
## stem_width2                 20        stem_width -164.1496877 0.9993961
## number_of_leaves2           20  number_of_leaves   26.3623151 0.9995585
## number_of_flowers2          20 number_of_flowers   14.6976330 0.9999190
## number_of_fruits2           20  number_of_fruits    6.5331718 0.9998921
## plant_height3               25      plant_height  -50.1627082 0.9971155
## stem_width3                 25        stem_width   -1.0098463 0.8648878
## number_of_leaves3           25  number_of_leaves   -0.2616747 0.9037458
## number_of_flowers3          25 number_of_flowers   17.2768117 0.9975784
## number_of_fruits3           25  number_of_fruits  -48.7532383 0.9971966
## plant_height4               30      plant_height   -1.7187274 0.2805717
## stem_width4                 30        stem_width   -8.6949963 0.1907439
## number_of_leaves4           30  number_of_leaves    2.0421660 0.1790735
## number_of_flowers4          30 number_of_flowers    1.7117770 0.4268387
## number_of_fruits4           30  number_of_fruits   -1.7046172 0.2392787

visualize the effects of sample size on Statistical Significance for All Traits

# Plotting P-Values for Each Trait Across Sample Sizes
ggplot(results, aes(x = sample_size, y = p_value, color = trait)) +
  geom_point() +
  geom_line() +
  geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
  labs(title = "Impact of Sample Size on the Statistical Significance for All Traits",
       x = "Sample Size", y = "P-Value") +
  theme_classic()

# Plotting Effect Sizes for Each Trait Across Sample Sizes
ggplot(results, aes(x = sample_size, y = effect_size, color = trait)) +
  geom_point() +
  geom_line() +
  labs(title = "Effect Size vs. Sample Size for All Traits",
       x = "Sample Size", y = "Effect Size (Estimate)") +
  theme_classic()

# yep! Never lied! But mind you, this is for an increase by 5 at each iteration, what if we increased it by 10?

Explore Sample size effects when we increase sample size by 10 each iteration

set.seed(123)

# Set up storage for results
results1 <- data.frame(sample_size = integer(),
                      effect_size = numeric(),
                       trait = character(),
                      p_value = numeric())

# Define the traits to analyze
traits <- c("plant_height", "stem_width", "number_of_leaves", 
            "number_of_flowers", "number_of_fruits")

# randomly sample data from trait_data, and run a linear mixed effects model on them
for (n in seq(10, nrow(trait_data), by = 10)) {  # Increase sample size by 10 each iteration

  # Randomly sample n observations
  temp_data <- trait_data[sample(1:nrow(trait_data), n, replace = TRUE), ]

  # Fit  logistic regression model
  mod1 <- glm(invasive_presence ~ plant_height + stem_width + number_of_leaves + 
             number_of_flowers + number_of_fruits, family = binomial, data = temp_data)
  
  # Extract effect sizes and p-values for all traits
  for (trait in traits) {
    effect_size1 <- coef(mod1)[trait]
    p_value1 <- summary(mod1)$coefficients[trait, 4]  # p-value for the trait

    # Store results
    results1 <- rbind(results, data.frame(sample_size = n, trait = trait, 
                                         effect_size = effect_size1, p_value = p_value1))
  }
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(results1) 
##                    sample_size             trait  effect_size   p_value
## plant_height                10      plant_height  -18.5332267 0.9972969
## stem_width                  10        stem_width    1.6086051 0.8763597
## number_of_leaves            10  number_of_leaves  -10.4848255 0.9999998
## number_of_flowers           10 number_of_flowers    9.3813917 0.9972634
## number_of_fruits            10  number_of_fruits   -0.4527097 0.7396533
## plant_height1               15      plant_height  -80.8796463 0.9994660
## stem_width1                 15        stem_width -250.0291975 0.9998209
## number_of_leaves1           15  number_of_leaves   30.8753950 0.9999580
## number_of_flowers1          15 number_of_flowers   74.6277971 0.9995312
## number_of_fruits1           15  number_of_fruits   11.4362910 0.9999389
## plant_height2               20      plant_height  -26.4745363 0.9995847
## stem_width2                 20        stem_width -164.1496877 0.9993961
## number_of_leaves2           20  number_of_leaves   26.3623151 0.9995585
## number_of_flowers2          20 number_of_flowers   14.6976330 0.9999190
## number_of_fruits2           20  number_of_fruits    6.5331718 0.9998921
## plant_height3               25      plant_height  -50.1627082 0.9971155
## stem_width3                 25        stem_width   -1.0098463 0.8648878
## number_of_leaves3           25  number_of_leaves   -0.2616747 0.9037458
## number_of_flowers3          25 number_of_flowers   17.2768117 0.9975784
## number_of_fruits3           25  number_of_fruits  -48.7532383 0.9971966
## plant_height4               30      plant_height   -1.7187274 0.2805717
## stem_width4                 30        stem_width   -8.6949963 0.1907439
## number_of_leaves4           30  number_of_leaves    2.0421660 0.1790735
## number_of_flowers4          30 number_of_flowers    1.7117770 0.4268387
## number_of_fruits4           30  number_of_fruits   -1.7046172 0.2392787
## number_of_fruits5           30  number_of_fruits  -39.8042697 0.9982792

visualize the effects of sample size on Statistical Significance for All Traits

# Plotting P-Values for Each Trait Across Sample Sizes
ggplot(results1, aes(x = sample_size, y = p_value, color = trait)) +
  geom_point() +
  geom_line() +
  geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") +
  labs(title = "Impact of Sample Size on the Statistical Significance for All Traits",
       x = "Sample Size", y = "P-Value") +
  theme_classic()

# Plotting Effect Sizes for Each Trait Across Sample Sizes
ggplot(results1, aes(x = sample_size, y = effect_size, color = trait)) +
  geom_point() +
  geom_line() +
  labs(title = "Effect Size vs. Sample Size for All Traits",
       x = "Sample Size", y = "Effect Size (Estimate)") +
  theme_classic() # don't think nothing changed!