This blog is run by Jason Jon Benedict and Doug Beare to share insights and developments on open source software that can be used to analyze patterns and trends in in all types of data from the natural world. Jason currently works as a geospatial professional based in Malaysia. Doug lives in the United Kingdom and is currently Director of Globefish Consultancy Services which provides scientific advice to organisations that currently include the STECF [Scientific, Technical and Economic Committee for Fisheries,] and ICCAT,

Tuesday, 2 June 2015

The commuter raindance?

A recently published scientific paper by Haberlie et al. spotted by Jason ( showed that urbanization and patterns of commuting had an impact on local prevalence of thunderstorms. Areas that have been heavily built-on had more storms of greater intensity than those left, either under agricultural production, or more naturally vegetated. The researchers also reported a connection between commuting activity and extreme weather with more storms occurring during weekdays. The study mentioned focused on the sub-tropical city of Atlanta, Georgia in the US and we wondered if the same findings would apply to tropical Malaysia. 

We started by looking at the 'event' data that are part of the weather data collected by Weather Underground that can be pulled into R using the 'weatherData' package for Penang and noticed that thunderstorm events have actually decreased since 2001, see below.

Apart from the downward long-term trend between 2001 and 2015, the strong bimodal pattern of seasonality is also clear with peaks in April/May and September/October being typical each year. 

We then looked at the data for the capital city of Malaysia, Kuala Lumpur, using data from the International airport; an area which has undergone tremendous land use change over the last 10-15 years or so. Interestingly the data for Kuala Lumpur showed a significant increase in thunderstorm activity which is opposite the that observed in Penang.

We do not know the reason for the differences between Penang and Kuala Lumpur. Penang has also undergone substantial urbanization and land-use change but it is of course an island and the surrounding ocean presumably moderates its climate to some degree. 

In addition to the long-term changes there are also seasonal and daily cycles in the storm event data. The seasonality at KL is summarised below and the peaks in April and October/November reflect the south-west and north-east monsoons.

The daily cycles we observed in the thunderstorm event data are more interesting and perhaps puzzling. The circle plot below summarises the frequencies of storm events by time of day between 2001 and 2015 at KL International Airport. There are many more storms in the afternoon and early evening between 14:00 and 20:00 with a big peak at 17:00, ie. Rush hour. However if rush hour is the cause of the peak in storm events in the evening then why do you not see a similar pattern in the morning ?

We also investigate whether storms were more prevalent at KL during week days. The storm event data are 'counts' so we used a Generalised Linear Model from the Poisson family as follows. First we calculated a 'trend' vector and appended it to the data so January 2001=1, January 2002 = 13 and so on. We then fitted the following series of 'nested' models to the data:

z0 <- glm(no_event~1,data=storm0,family=quasipoisson) # The NULL model
z1 <- glm(no_event~trend,data=storm0,family=quasipoisson) # Is trend sig?
z2 <- glm(no_event~trend+monthf,data=storm0,family=quasipoisson)
z3 <- glm(no_event~trend+monthf+day,data=storm0,family=quasipoisson)
z4 <- glm(no_event~trend*monthf,data=storm0,family=quasipoisson)
z5 <- glm(no_event~trend*monthf+day,data=storm0,family=quasipoisson)

z0 is the NULL model against which the others are tested. Z1 tests whether trend is signficant and so on. Z3 and z5 test the effect of 'day of week' given that trend and monthf (month as a 'factor') are in the model. Z5 is only different from z3 in the sense that it 'allows' trend and monthf to interact or covary with one another. In English that means that the dependence on month can vary each year.

The results (below) show that there is no statistically significant effect of storm-event, ie. The number of storm events is the same on every day of the week. According to our analysis, therefore, commuting activity does not appear to be affecting thunderstorm initiation in Kuala Lumpur and the data do not support the findings of Haberlie et al.


Analysis of Deviance Table

Model 1: no_event ~ 1
Model 2: no_event ~ year
Model 3: no_event ~ trend + monthf
Model 4: no_event ~ trend + monthf + day
Model 5: no_event ~ trend * monthf
Model 6: no_event ~ trend * monthf + day
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 5098 6324.4
2 5097 6201.6 1 122.789 < 2.2e-16 ***
3 5086 5967.8 11 233.824 < 2.2e-16 ***
4 5080 5965.0 6 2.767 0.8558 NOT SIG
5 5075 5934.6 5 30.379 2.694e-05 ***
6 5069 5931.8 6 2.781 0.8543 NOT SIG

# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The R code used to produce the plots shown above are also provided below if you wish to replicate this at another location.

# Load required libraries
# Load fonts
# Get weather data from Weather Underground using weatherData library
we <- getWeatherForDate("WMKK", "2001-01-01","2014-12-31", opt_detailed=T, opt_custom_columns=T, custom_columns=c(11))
# Convert to character
we$Events <- as.character(we$Events) 
# Assign values of '1' to Thunderstorm Event and '0' to Non-Thunderstorm Event
we$Events[we$Events == "" ] <- NA
we$Events[we$Events == "Rain"] <- 0
we$Events[we$Events == "Rain-Thunderstorm"] <- 1
we$Events[we$Events == "Thunderstorm"] <- 1
we$Events[$Events)] <- 0
# Convert to numeric
# Calculate number of events
No_Events <- tapply(diff(c(0, we$Events)) == 1, as.Date((we$Time),tz="Asia/Kuala_Lumpur"), sum)
we.df <- data.frame(Date = as.Date(names(No_Events)), No_Event = unname(No_Events))
# Rename columns
colnames(we.df) <- c("date","no_event")
# Add date columns
we.df$month <- as.numeric(as.POSIXlt(we.df$date)$mon+1)
we.df$year <- strftime(we.df$date, "%Y")
we.df$week <- as.Date(cut(we.df$date,breaks = "week",start.on.monday = TRUE))
we.df$monthf <- factor(we.df$month,levels=as.character(1:12),
we.df$day <- strftime(we.df$date,'%A')
we.df$weekend = chron::is.weekend(we.df$date)
# Aggregate no events by month
month.agg <- aggregate(no_event ~ month + year, we.df, FUN = sum)
month.agg$date <- as.POSIXct(paste(month.agg$year, month.agg$month, "01", sep = "-"))
month.agg$monthf <- factor(month.agg$month,levels=as.character(01:12),
# Plot number of events by month
ts <- ggplot(month.agg,aes(x=date,y=no_event,fill=no_event)) +
      geom_bar(stat="identity") +
      stat_smooth(method = "glm.nb",formula = y ~ x, data = month.agg, se = TRUE,colour="blue",size=0.5)+
      scale_x_datetime(labels = date_format("%Y"),breaks = "1 year")+
      ggtitle("Number of Monthly Thunderstorm Events from 2001 - 2014\n(Observations at KLIA, Selangor)\n")+
      theme_bw()+ ylab("No of events\n") + xlab("")+
      theme(axis.text.x  = element_text(size=11,family="Clear Sans",face="bold"), legend.position="none",
      axis.text.y  = element_text(size=11,family="Clear Sans",face="bold"), legend.position="none",
      axis.title.y=element_text(size=12,colour="grey20",face="bold",family="Clear Sans"),
      panel.grid.major = element_line(colour = "grey",size=0.25,linetype='longdash'),
      plot.title = element_text(lineheight=1, face="bold",size = 15, colour = "grey20",family="Clear Sans"))
month.mean <- ddply(month.agg,.(monthf),summarize,mean_events=mean(no_event,na.rm=T))
boxplot.title = 'Number of Thunderstorm Events by Month (2001 - 2014)'
boxplot.subtitle = 'Data source : Weather Underground (2015)'
# Box plot number of thunderstorm events by month
b <- ggplot(month.agg,aes(monthf,no_event,col=year)) +
     geom_boxplot(outlier.shape = NA,fill="ivory2",range=0.5,col="black")+
     ggtitle("Number of Thunderstorm Events by Month at KLIA, Selangor (2001 - 2014) \n Source: Weather Underground (2015)\n")+
     ylab("No of events\n")+ xlab("") + stat_boxplot(geom ='errorbar',col="black")+
     geom_text(aes(label=year),hjust=-0.25, vjust=0,size=3,face="bold",family="Segoe UI Semibold")+
     theme(axis.text.x  = element_text(size=11,family="Clear Sans",face="bold"), legend.position="none",
     axis.text.y  = element_text(size=11,family="Clear Sans",face="bold"), legend.position="none",
     axis.title.y=element_text(size=12,colour="grey20",face="bold",family="Clear Sans"),
     panel.grid.major = element_blank(),
     panel.grid.minor = element_blank(),
     plot.title = element_text(lineheight=1, face="bold",size = 15, colour = "grey20",family="Clear Sans"))
# Reformat weather event data from Weather Underground to get count of hourly thunderstorm events
we$hour <- as.numeric(format(strptime(we$Time, format = "%Y-%m-%d %H:%M"),format = "%H"))
we$min <- as.numeric(format(strptime(we$Time, format = "%Y-%m-%d %H:%M"),format = "%M"))
# Subset only events on the hour
we2<- subset(we, min == 0)
we3 <- ddply(we2,.(hour),summarize, event = sum(Events,na.rm=T))
# Plot number of thunderstorm events by hour of day
hr <- ggplot(we3,aes(x=hour,y=event))+
      geom_bar(breaks=seq(0,24),width = 1,colour="black",stat="identity",fill="blue",drop=TRUE)+ 
      xlab("Source: Weather Underground (2015)")+
      ggtitle("Thunderstorm Event Frequency by Hour\nobserved at KLIA, Selangor\n")+
      guides(colour = guide_legend(show = FALSE)) +
      axis.title.y=element_text(size=12,hjust=0.65,colour="grey20",family="Clear Sans"),
      axis.title.x=element_text(size=10,colour="grey20",family="Clear Sans"),
      axis.text.y=element_text(size=10,colour="grey20",family="Clear Sans"),
      axis.text.x=element_text(size=10,colour="grey20",face="bold",family="Clear Sans"),
      plot.title = element_text(lineheight=1.2, face="bold",size = 14, colour = "grey20"),
      plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"))
Created by Pretty R at