Hourly Average Pollution Across All Stations for Each Hour of the Day

Understanding the Problem and Requirements

The problem at hand involves calculating the hourly average pollution across multiple stations for a full year. The dataset in question, pollution_contamimants_hourly, contains hourly air pollution measurements for 8 different stations in 2022. The task is to find the average pollution across all stations for every hour of the day for the entire year.

Section 1: Preparing the Dataset

Before proceeding with the calculation, it’s essential to prepare the dataset by cleaning and reshaping it into a suitable format. The provided code snippet attempts to create a new dataframe hourly_average_pollution using a loop that iterates over each day of the year. However, there are several issues with this approach.

Step 1: Load Required Libraries

The first step is to load the necessary libraries in R. In this case, we’ll need the read.table() function from the base R package and possibly other libraries like dplyr or lubridate for data manipulation and date calculations.

# Load required libraries
library(dplyr)
library(lubridate)

Step 2: Read and Prepare the Dataset

The dataset is provided in a text format with each line representing an observation. We can use read.table() to convert this into a data frame.

# Read the dataset from the text file
lines <- "CODI_CONTAMINANT ESTACIO ANY MES DIA_Hours Pollution date_daily 
193627 8 4 '2022-3-4' 19 31.00 '2022-03-04 19:00:00' 
45404 6 54 '2022-8-7' 24 0.20 '2022-08-08 00:00:00' 
65161 6 57 '2022-9-26' 4 0.40 '2022-09-26 04:00:00' 
308579 12 54 '2022-8-11' 22 16.00 '2022-08-11 22:00:00' 
497690 998 43 '2022-8-5' 6 4999.00 '2022-08-05 06:00:00'
402858 101 57 '2022-11-6' 3 1.98 '2022-11-06 03:00:00'"

# Convert the lines into a data frame
DF <- read.table(text = lines, header = TRUE)

Step 3: Clean and Transform the Data

The dataset needs to be cleaned and transformed to extract the required information.

# Extract the relevant columns from the data frame
pollution_data <- DF[, c("Pollution", "date_daily")]

# Convert the date column into a date format
pollution_data$date_daily <- ymd(pollution_data$date_daily)

# Extract the station names (assuming they are in the 'ESTACIO' column)
stations <- unique(DF$ESTACIO)

Section 2: Grouping and Aggregating Data

We need to group the data by hour of the day, convert it into a long format for easier manipulation, and then calculate the average pollution across all stations.

# Extract the hour from each date
pollution_data$hour <- wday(pollution_data$date_daily) * 24 + hour(pollution_data$date_daily)

# Group by hour and extract the station names
station_hours <- group_by(pollution_data, Station, hour)

Step 3: Calculating Average Pollution Across All Stations

Now we can calculate the average pollution for each hour of the day across all stations.

# Calculate the mean pollution for each hour
average_pollution <- summarise(station_hours, avg_pollution = mean(Pollution))

# Ungroup and return the result
result <- select(average_pollution, hour, avg_pollution)

Section 3: Handling Duplicate Hours

Since there are multiple observations for some hours of the day, we need to handle these duplicates when calculating the average.

# Group by hour and calculate the mean pollution across all stations
result <- group_by(pollution_data, Station, hour) %>%
    summarise(avg_pollution = mean(Pollution))

# Ungroup and return the result
final_result <- select(result, hour, avg_pollution)

Section 4: Visualizing the Results

To visualize the results, we can create a bar chart with the average pollution for each hour of the day across all stations.

# Load the ggplot2 library for visualization
library(ggplot2)

# Create a data frame to store the hours and corresponding mean values
hourly_means <- data.frame(hour = unique(final_result$hour), avg_pollution = final_result$avg_pollution)

# Plot the bar chart using ggplot2
ggplot(hourly_means, aes(x = hour, y = avg_pollution)) +
    geom_bar(stat = "identity") +
    labs(x = "Hour of Day", y = "Average Pollution")

Section 5: Combining All Steps into a Single Function

# Create a function to calculate the average pollution across all stations for each hour of the day
calculate_hourly_average_pollution <- function() {
    # Load required libraries
    library(dplyr)
    library(lubridate)

    # Read the dataset from the text file
    lines <- "CODI_CONTAMINANT ESTACIO ANY MES DIA_Hours Pollution date_daily 
193627 8 4 '2022-3-4' 19 31.00 '2022-03-04 19:00:00' 
45404 6 54 '2022-8-7' 24 0.20 '2022-08-08 00:00:00' 
65161 6 57 '2022-9-26' 4 0.40 '2022-09-26 04:00:00' 
308579 12 54 '2022-8-11' 22 16.00 '2022-08-11 22:00:00' 
497690 998 43 '2022-8-5' 6 4999.00 '2022-08-05 06:00:00'
402858 101 57 '2022-11-6' 3 1.98 '2022-11-06 03:00:00'"

    # Convert the lines into a data frame
    DF <- read.table(text = lines, header = TRUE)

    # Extract the relevant columns from the data frame
    pollution_data <- DF[, c("Pollution", "date_daily")]

    # Convert the date column into a date format
    pollution_data$date_daily <- ymd(pollution_data$date_daily)

    # Extract the station names (assuming they are in the 'ESTACIO' column)
    stations <- unique(DF$ESTACIO)

    # Extract the hour from each date
    pollution_data$hour <- wday(pollution_data$date_daily) * 24 + hour(pollution_data$date_daily)

    # Group by hour and extract the station names
    station_hours <- group_by(pollution_data, Station, hour)

    # Calculate the mean pollution for each hour
    average_pollution <- summarise(station_hours, avg_pollution = mean(Pollution))

    # Ungroup and return the result
    final_result <- select(average_pollution, hour, avg_pollution)

    # Return the final result
    return(final_result)
}

# Call the function to calculate the hourly average pollution across all stations for each hour of the day
hourly_average_pollution <- calculate_hourly_average_pollution()
print(hourly_average_pollution)

Last modified on 2023-05-16