DNA Damage in VSMCs

Author

Kaitlyn Pipkin

Project Overview

The goal of this project is to assess the extent of DNA damage in vascular smooth muscle cells (VSMCs) following treatment with Doxorubicin. I will be using a data set that consists of 96 CSV files generated from ImageJ analysis. Each sample includes two channels: the DAPI (blue) channel, which stains cell nuclei, and the γH2AX (green) channel, an established marker of DNA damage. VSMCs were treated with Doxorubicin at four time points: days 3, 4, 7, and 11. VSMCs were also treated at different concentrations of Doxorubicin: 0 μM, 0.1 μM, 0.2 μM, and 0.4 μM. This study aims to determine the percentage of γH2AX signal normalized to nuclear area across different conditions to determine if there is any impact from dosage and duration on DNA damage.

Set the working directory

#setwd("C:/Users/kpipkin1/Documents/GitHub/kaitlyn/projects/DNA Damage in VSMCs")

Load the necessary libraries

library(tidyverse) 

Read in all of the CSV Files

file_paths <- list.files(pattern = "\\.csv$", full.names = TRUE)

In this code chunk, I am using Base R. My variable here is “file_paths”. This variable will contain a list of file names that I have stored in my current working directory. The “list.files” function searches for files in that current working directory. The “pattern=(2 back slashes).csv$” argument will allow for only the files that end in “.csv” to be included in my “file_paths” variable. This is because in regular expressions, a period means “any character” so we have to escape that with 2 backslashes to match to a period. Because the filename ends after csv, I use a dollar sign at the end of the pattern to indicate that csv is the end of the string so it will only pull files that end in .csv, not just a file that contains .csv. I use the “full.names=TRUE” peice to return the entire file path instead of just the file name itself. # Read all of the .csv Files into one Data set

Read all Files into one Data Set

all_data<-file_paths %>%
  set_names() %>%
  map_dfr(read_csv, .id = "filename")

My new variable here is “all_data”, which is where all of my file_paths are combined into one big data set. The file_paths%>% will start with the list of the .csv file paths and allow them to be piped in the next function, “set_names()”. This allows R to know which row came from which file. The “map_dfr(read_csv,.id=filename)” Makes R read each csv file in the list and combine all of the files together into one data frame. In addition, R then adds a new column to the big data frame called “filename”.The map_dfr function is from a package in tidyverse called purrr. dfr means data frame row bind.

Use File Names to Pull Metadata

all_data <- all_data %>%
  mutate(
    base_name = filename %>%
      basename() %>%
      str_remove(" \\(blue\\)| \\(green\\)") %>%
      str_remove("-summary\\.csv$"),
    day = str_extract(filename, "\\d+d"),
    dosage = case_when(
      str_detect(filename, "0\\.1uM") ~ "0.1uM",
      str_detect(filename, "0\\.2uM") ~ "0.2uM",
      str_detect(filename, "0\\.4uM") ~ "0.4uM",
      str_detect(filename, "no doxo") ~ "0uM",
      TRUE ~ NA_character_
    ),
    channel = case_when(
      str_detect(filename, "blue") ~ "blue",
      str_detect(filename, "green") ~ "green",
      TRUE ~ NA_character_
    )
  )

The “mutate()” function adds new columns to the big dataset by extracting the information from the filename column. The basename column uses only the dosage, sample ID, and timepoint to help join files together later. The timepoint is extracted from the filename and matches to patterns listed in the file name which in this case is “11d” for example. By using “case_when()” you can add a column for dosage by finding patterns like “0.1uM” or “no doxo” and defining what they mean. The channel column is based on the fluorescence color also pulled from the filename. Both the mutate and case_when functions are from packages in tidyverse called dplyr and stringr.

Separate the Data set into Blue and Green Channels

blue <- all_data %>%
  filter(channel == "blue") %>%
  select(base_name, day, dosage, Count, `Total Area`, `Average Size`, `%Area`, Mean) %>%
  rename_with(~ paste0(., "_blue"), -c(base_name, day, dosage))

green <- all_data %>%
  filter(channel == "green") %>%
  select(base_name, Count, `Total Area`, `Average Size`, `%Area`, Mean) %>%
  rename_with(~ paste0(., "_green"), -base_name)

This code split my dataset into two tables: one with images from the blue channel and one with images from the green channel. Then I rename the measurment columns with the appropriate channel name to help join them later. The filter and rename_with functions are both from packages in tidyverse called dplyr.

Join Green and Blue Channel Images Together

combined <- inner_join(blue, green, by = "base_name")

The “inner_join” function merges the blue and green data tables together based on their basename, which includes the dosage, timepoint, and sample ID. This function is from the dplyr package in tidyverse. The joining of these two datasets is a new varibale called “combined”.

Calculate the Percent of Green area (yH2AX) normalized to the Blue area (DAPI)

combined <- combined %>%
  mutate(percent_gammaH2AX = (`%Area_green` / `%Area_blue`) * 100)

Here, I add a new column called “percent_gammaH2AX to the”combined” dataset using the “mutate()” function. This column will divide the precent of the green area (the yH2AX signal, which measures DNA damage) by the percent of the blue area (DAPI signal, which measures the total nuclei). I then multiply this by 100 to get a percentage which gives me the normalized value that gives me the percent of DNA damage relative to the number of cells.

Clean Data

combined <- combined %>%
  filter(!is.na(dosage), !is.na(percent_gammaH2AX), !is.na(day)) %>%
  mutate(
    dosage = factor(dosage, levels = c("0uM", "0.1uM", "0.2uM", "0.4uM")),
    day = factor(day, levels = c("3d", "4d", "7d", "11d"))
  ) %>%
  droplevels()

In this step I remove any rows that do not have dosage, percent_gammaH2AX, or timepoints using “filter()”. Next, “mutate()” is used to convert the dosage and day columns into categorical variables and set them in ascending order. “droplevels” is then used to remove any dosage or day lables that were no longer used in my new filtered dataset. Drop levels, is.na, and factor are all functions in base R.

Pick Colors for the Plot

box_fill <- c(
  "3d" = "#66c2a5",  # green
  "4d" = "#fc8d62",  # orange
  "7d" = "#8da0cb",  # blue
  "11d" = "#e78ac3"  # pink
)

Here I made a variable called “box_fil” where each day is assigned to a specific color using the “c()” function.

Generate a Combined Plot looking at the Percent of DNA Damage Across Dosage of Doxorubicin

ggplot(combined, aes(x = dosage, y = percent_gammaH2AX, fill = day)) +
  geom_boxplot(position = position_dodge(width = 0.8), alpha = 0.4, width = 0.4, outlier.shape = NA) +
  geom_jitter(aes(color = day), 
              position = position_jitterdodge(jitter.width = 0.2, dodge.width = 0.8), 
              size = 2, alpha = 0.8) +
  scale_fill_manual(values = box_fill) +
  scale_color_manual(values = box_fill) +
  labs(
    title = "Assessing Dose- and Time-Dependent DNA Damage in VSMCs Following Doxorubicin Treatment",
    x = "Doxorubicin Dose (uM)",
    y = "% γH2AX / % DAPI",
    fill = "Day",
    color = "Day"
  ) +
  theme_minimal()

In this step, I am ready to graph so I will use the ggplot2 package in tidyverse. The “ggplot()” function creates a box plot. The “aes()” function maps dosage to the x axis and percent_gammaH2AX to the y axis and the boxes are filled based on their corresponding “box_fill” color. The “geom_boxplot()” function shoes the distribution f the values in each group and “geom_jitter() allows me to add the individual values. To make the individual points a darker color than their corresponding box_fill, I use”scale_fill_manual()” and “scale_color_manual()”. “labs()” allows me to name my title and axis lables. I used the “theme_minimal()”function to keep the plot clean and simple.

Plot by the Individual Days Separately

plot_day <- function(day_label) {
  combined %>%
    filter(day == day_label) %>%
    ggplot(aes(x = dosage, y = percent_gammaH2AX)) +
    geom_boxplot(fill = box_fill[day_label], alpha = 0.4, width = 0.3, outlier.shape = NA) +
    geom_jitter(color = box_fill[day_label], width = 0.15, size = 2, alpha = 0.8) +
    labs(
      title = paste("Dose-Dependent DNA Damage in VSMCs on", day_label),
      x = "Doxorubicin Dose (uM)",
      y = "% γH2AX / % DAPI"
    ) +
    theme_minimal()
}


plot_day("3d")

plot_day("4d")

plot_day("7d")

plot_day("11d")

In this final step, I wrote a function called “plot_day()”. I then use the argument, “day_label” to create a boxplot showing DNA damage across Doxorubicin treatments on that specific day. this filters the combined data set to only include rows matching the day.