This blog is run by Jason Jon Benedict and Doug Beare to share insights and developments on open source software that can be used to analyze patterns and trends in in all types of data from the natural world. Jason currently works as a GIS Consultant based in Kuala Lumpur. Doug lives in the United Kingdom and is currently Director of Globefish Consultancy Services which provides scientific advice to organisations that currently include the STECF [Scientific, Technical and Economic Committee for Fisheries,] and ICCAT,

Tuesday, 2 June 2015

The commuter raindance?

A recently published scientific paper by Haberlie et al. spotted by Jason ( showed that urbanization and patterns of commuting had an impact on local prevalence of thunderstorms. Areas that have been heavily built-on had more storms of greater intensity than those left, either under agricultural production, or more naturally vegetated. The researchers also reported a connection between commuting activity and extreme weather with more storms occurring during weekdays. The study mentioned focused on the sub-tropical city of Atlanta, Georgia in the US and we wondered if the same findings would apply to tropical Malaysia. 

We started by looking at the 'event' data that are part of the weather data collected by Weather Underground that can be pulled into R using the 'weatherData' package for Penang and noticed that thunderstorm events have actually decreased since 2001, see below.

Apart from the downward long-term trend between 2001 and 2015, the strong bimodal pattern of seasonality is also clear with peaks in April/May and September/October being typical each year. 

We then looked at the data for the capital city of Malaysia, Kuala Lumpur, using data from the International airport; an area which has undergone tremendous land use change over the last 10-15 years or so. Interestingly the data for Kuala Lumpur showed a significant increase in thunderstorm activity which is opposite the that observed in Penang.

We do not know the reason for the differences between Penang and Kuala Lumpur. Penang has also undergone substantial urbanization and land-use change but it is of course an island and the surrounding ocean presumably moderates its climate to some degree. 

In addition to the long-term changes there are also seasonal and daily cycles in the storm event data. The seasonality at KL is summarised below and the peaks in April and October/November reflect the south-west and north-east monsoons.

The daily cycles we observed in the thunderstorm event data are more interesting and perhaps puzzling. The circle plot below summarises the frequencies of storm events by time of day between 2001 and 2015 at KL International Airport. There are many more storms in the afternoon and early evening between 14:00 and 20:00 with a big peak at 17:00, ie. Rush hour. However if rush hour is the cause of the peak in storm events in the evening then why do you not see a similar pattern in the morning ?

We also investigate whether storms were more prevalent at KL during week days. The storm event data are 'counts' so we used a Generalised Linear Model from the Poisson family as follows. First we calculated a 'trend' vector and appended it to the data so January 2001=1, January 2002 = 13 and so on. We then fitted the following series of 'nested' models to the data:

z0 <- glm(no_event~1,data=storm0,family=quasipoisson) # The NULL model
z1 <- glm(no_event~trend,data=storm0,family=quasipoisson) # Is trend sig?
z2 <- glm(no_event~trend+monthf,data=storm0,family=quasipoisson)
z3 <- glm(no_event~trend+monthf+day,data=storm0,family=quasipoisson)
z4 <- glm(no_event~trend*monthf,data=storm0,family=quasipoisson)
z5 <- glm(no_event~trend*monthf+day,data=storm0,family=quasipoisson)

z0 is the NULL model against which the others are tested. Z1 tests whether trend is signficant and so on. Z3 and z5 test the effect of 'day of week' given that trend and monthf (month as a 'factor') are in the model. Z5 is only different from z3 in the sense that it 'allows' trend and monthf to interact or covary with one another. In English that means that the dependence on month can vary each year.

The results (below) show that there is no statistically significant effect of storm-event, ie. The number of storm events is the same on every day of the week. According to our analysis, therefore, commuting activity does not appear to be affecting thunderstorm initiation in Kuala Lumpur and the data do not support the findings of Haberlie et al.


Analysis of Deviance Table

Model 1: no_event ~ 1
Model 2: no_event ~ year
Model 3: no_event ~ trend + monthf
Model 4: no_event ~ trend + monthf + day
Model 5: no_event ~ trend * monthf
Model 6: no_event ~ trend * monthf + day
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 5098 6324.4
2 5097 6201.6 1 122.789 < 2.2e-16 ***
3 5086 5967.8 11 233.824 < 2.2e-16 ***
4 5080 5965.0 6 2.767 0.8558 NOT SIG
5 5075 5934.6 5 30.379 2.694e-05 ***
6 5069 5931.8 6 2.781 0.8543 NOT SIG

# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The R code used to produce the plots shown above are also provided below if you wish to replicate this at another location.

# Load required libraries
# Load fonts
# Get weather data from Weather Underground using weatherData library
we <- getWeatherForDate("WMKK", "2001-01-01","2014-12-31", opt_detailed=T, opt_custom_columns=T, custom_columns=c(11))
# Convert to character
we$Events <- as.character(we$Events) 
# Assign values of '1' to Thunderstorm Event and '0' to Non-Thunderstorm Event
we$Events[we$Events == "" ] <- NA
we$Events[we$Events == "Rain"] <- 0
we$Events[we$Events == "Rain-Thunderstorm"] <- 1
we$Events[we$Events == "Thunderstorm"] <- 1
we$Events[$Events)] <- 0
# Convert to numeric
# Calculate number of events
No_Events <- tapply(diff(c(0, we$Events)) == 1, as.Date((we$Time),tz="Asia/Kuala_Lumpur"), sum)
we.df <- data.frame(Date = as.Date(names(No_Events)), No_Event = unname(No_Events))
# Rename columns
colnames(we.df) <- c("date","no_event")
# Add date columns
we.df$month <- as.numeric(as.POSIXlt(we.df$date)$mon+1)
we.df$year <- strftime(we.df$date, "%Y")
we.df$week <- as.Date(cut(we.df$date,breaks = "week",start.on.monday = TRUE))
we.df$monthf <- factor(we.df$month,levels=as.character(1:12),
we.df$day <- strftime(we.df$date,'%A')
we.df$weekend = chron::is.weekend(we.df$date)
# Aggregate no events by month
month.agg <- aggregate(no_event ~ month + year, we.df, FUN = sum)
month.agg$date <- as.POSIXct(paste(month.agg$year, month.agg$month, "01", sep = "-"))
month.agg$monthf <- factor(month.agg$month,levels=as.character(01:12),
# Plot number of events by month
ts <- ggplot(month.agg,aes(x=date,y=no_event,fill=no_event)) +
      geom_bar(stat="identity") +
      stat_smooth(method = "glm.nb",formula = y ~ x, data = month.agg, se = TRUE,colour="blue",size=0.5)+
      scale_x_datetime(labels = date_format("%Y"),breaks = "1 year")+
      ggtitle("Number of Monthly Thunderstorm Events from 2001 - 2014\n(Observations at KLIA, Selangor)\n")+
      theme_bw()+ ylab("No of events\n") + xlab("")+
      theme(axis.text.x  = element_text(size=11,family="Clear Sans",face="bold"), legend.position="none",
      axis.text.y  = element_text(size=11,family="Clear Sans",face="bold"), legend.position="none",
      axis.title.y=element_text(size=12,colour="grey20",face="bold",family="Clear Sans"),
      panel.grid.major = element_line(colour = "grey",size=0.25,linetype='longdash'),
      plot.title = element_text(lineheight=1, face="bold",size = 15, colour = "grey20",family="Clear Sans"))
month.mean <- ddply(month.agg,.(monthf),summarize,mean_events=mean(no_event,na.rm=T))
boxplot.title = 'Number of Thunderstorm Events by Month (2001 - 2014)'
boxplot.subtitle = 'Data source : Weather Underground (2015)'
# Box plot number of thunderstorm events by month
b <- ggplot(month.agg,aes(monthf,no_event,col=year)) +
     geom_boxplot(outlier.shape = NA,fill="ivory2",range=0.5,col="black")+
     ggtitle("Number of Thunderstorm Events by Month at KLIA, Selangor (2001 - 2014) \n Source: Weather Underground (2015)\n")+
     ylab("No of events\n")+ xlab("") + stat_boxplot(geom ='errorbar',col="black")+
     geom_text(aes(label=year),hjust=-0.25, vjust=0,size=3,face="bold",family="Segoe UI Semibold")+
     theme(axis.text.x  = element_text(size=11,family="Clear Sans",face="bold"), legend.position="none",
     axis.text.y  = element_text(size=11,family="Clear Sans",face="bold"), legend.position="none",
     axis.title.y=element_text(size=12,colour="grey20",face="bold",family="Clear Sans"),
     panel.grid.major = element_blank(),
     panel.grid.minor = element_blank(),
     plot.title = element_text(lineheight=1, face="bold",size = 15, colour = "grey20",family="Clear Sans"))
# Reformat weather event data from Weather Underground to get count of hourly thunderstorm events
we$hour <- as.numeric(format(strptime(we$Time, format = "%Y-%m-%d %H:%M"),format = "%H"))
we$min <- as.numeric(format(strptime(we$Time, format = "%Y-%m-%d %H:%M"),format = "%M"))
# Subset only events on the hour
we2<- subset(we, min == 0)
we3 <- ddply(we2,.(hour),summarize, event = sum(Events,na.rm=T))
# Plot number of thunderstorm events by hour of day
hr <- ggplot(we3,aes(x=hour,y=event))+
      geom_bar(breaks=seq(0,24),width = 1,colour="black",stat="identity",fill="blue",drop=TRUE)+ 
      xlab("Source: Weather Underground (2015)")+
      ggtitle("Thunderstorm Event Frequency by Hour\nobserved at KLIA, Selangor\n")+
      guides(colour = guide_legend(show = FALSE)) +
      axis.title.y=element_text(size=12,hjust=0.65,colour="grey20",family="Clear Sans"),
      axis.title.x=element_text(size=10,colour="grey20",family="Clear Sans"),
      axis.text.y=element_text(size=10,colour="grey20",family="Clear Sans"),
      axis.text.x=element_text(size=10,colour="grey20",face="bold",family="Clear Sans"),
      plot.title = element_text(lineheight=1.2, face="bold",size = 14, colour = "grey20"),
      plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"))
Created by Pretty R at

Monday, 13 April 2015

Boat Lights from the VIIRS: Shedding some light on night-time fishing activity in the Indo-Malayan waters

Light emitted at night due to human activities can easily be seen from space and detected by satellite. Many compelling pictures of the Earth at night have been taken revealing cities, roads, and conurbations. Light emitted at sea has, however, received less attention. Applications relating to fisheries and their management might prove particularly useful. NOAA/NASA and partners have recently developed a boat detection product derived from satellite imagery at high temporal and spatial resolution and the data can be downloaded for free from here as either kml/csv files and processed in R. 

For many years (since 1972 to be precise), global low light night imaging was collected by the US Air Force's Defence Meteorological Satellite Programmes Operational Linescan System (DMSP-OLS). While night-time lights collected from DMSP-OLS has major shortcomings, there were no reasonable alternative products for mapping night-lights of lit fishing boats. This all changed in 2011 with the launch of NASA/NOAA Suomi National Polar Partnership satellite equipped with the Visible Infrared Radiometer Suite (VIIRS) as the primary imaging sensor. You can find out many of the advantages the new VIIRS sensor has over the DMPS-OLS here. One of the most important improvements is that the 'Day/Night band' on the VIIRS that is used for night-time light detection has a detection footprint 45 times smaller than that on the DMSP, which ultimately means it can detect light sources a lot more smaller than what the DMPS-OLS could! The VIIRS also has a better dynamic range and lower detection limits which means it can identify and discern more sources of light at night with less saturation when it gets too bright with surrounding lights. These two improvements are key in enabling better detection of boat lights at night.

Commercial fishermen the world over use very bright light to attract squid and small pelagics to their nets, for example, and these are visible from space. Here we use the NOAA boat light data to animate nightly boat lights during February 2015 (cumulative for entire night) for a part of south-east Asia including Indonesia, Malaysia and the northern tip of Australia. We think that many of these lights reflect fishing activity, but more work needs to be done to detect and isolate shipping etc. A concerted fishers' knowledge/consultation process would be essential for enabling us to understand these data properly. We think it would it would be worth the effort, though, and could yield fascinating insights into fleet behavior and the species targeted.

As usual, we provide the R code below on how we produced the above animation of the boat lights detection product.

##### Code to read and produce animation of boat night light detection data from NGDC site #####
# Load required libraries
# Set working directory
# Load fonts
# Set date range for csv files to be downloaded from NGDC site
date.range <- seq.Date(from=as.Date('2015-02-01'),as.Date('2015-02-28'),by='1 day')
# Get dates into relevant format
date <- as.character(date.range)
date1 <- gsub("-","",date)
# Get urls to scrape boat light csv files off  NGDC site
urls <- sprintf((paste("",c(date1),"/VNB_npp_d",c(date1),
# Read boat light data files in csv format and bind them
boats <-,lapply(urls,read.csv))
# Extract useful columns
boats.df <- boats.csv[-c(18,19)]
# Reformat date and add other date and time columns
boats.df$Date <- as.POSIXct(boats.df$Date_Mscan, format="%Y-%m-%d %H:%M:%S",tz="GMT")
boats.df$Dates <- as.Date(boats.df$Date)
boats.df$Year <- as.numeric(as.POSIXlt(boats.df$Date)$year+1900)
boats.df$Month <- as.numeric(as.POSIXlt(boats.df$Date)$mon+1)
boats.df$Monthf <- factor(boats.df$Month,levels=as.character(1:12),
boats.df$Weekday = as.numeric(format(as.POSIXlt(boats.df$Date),"%u"))
boats.df$Yearmonth <- as.yearmon(boats.df$Date)
boats.df$Yearmonthf <- factor(boats.df$Yearmonth)
boats.df$Week <- as.numeric(format(as.POSIXlt(boats.df$Date),"%W"))
boats.df$Day <- strftime(boats.df$Date,'%A')
boats.df$Jday <- strptime(boats.df$Date, "%Y-%m-%d")$yday+1
boats.df$Hour <- as.numeric(format(strptime(boats.df$Date, format = "%Y-%m-%d %H:%M"),format = "%H"))
# Extract only detections that are classified as 'boat' from quality flag for nightboat detection algorithm
boats.df1 <- subset(boats.df,QF_Detect==1)
# Import shapefiles
world <- readOGR(dsn="D:/GeoSpatial/Data/World_Boundaries", layer="TM_WORLD_BORDERS-0.3") # World boundary shp's download from
eez <- readOGR(dsn="D:/Projects/BoatLights", layer="EEZs") # EEZ boundary shp's downloaded from
mpa <- readOGR(dsn="D:/Projects/BoatLights", layer="MPAs") # MPA boundary shp's downloaded from CT Atlas website -
# Fortify layers for ggplot mapping
world.df <- fortify(world)
eez.df <- fortify(eez)
mpa.df <- fortify(mpa)
# Set theme options for map
theme_opts <- list(theme(panel.grid.minor = element_blank(),
                         panel.grid.major = element_blank(),
                         panel.background = element_rect(fill="black"),
                         panel.border = element_rect(colour="black"),
                         legend.text = element_text(size = 8,face="bold",family="Myriad Pro"),
                         legend.position = "right",
                         axis.line = element_blank(),
                         axis.text.x = element_blank(),
                         axis.text.y = element_blank(),
                         axis.ticks = element_blank(),
                         axis.title.x = element_blank(),
                         plot.title = element_text(size=20,face="bold",family="Graph Black",hjust=0,vjust=-0.5),
                         axis.title.y = element_blank()))
# Run loop to produce boat light maps for each night in February 2015
for (i in date)
b <- ggplot(subset(boats.df1,Dates ==i),aes(Lon_DNB,Lat_DNB))+
     geom_polygon(data=mpa.df, aes(long,lat, group=group,fill="MPA"),colour=NA,alpha=0.5)+
     geom_point(color = "gold", size = 0.5, alpha = 0.5) +
     geom_polygon(data=eez.df, aes(long,lat, group=group,colour="EEZ"),fill=NA,size=0.25,linetype="dashed")+
     geom_polygon(data=world.df, aes(long,lat, group=group),fill="black",colour="darkgray",size=0.35)+
     theme_bw()+ ylab("")+ xlab("")+
     ggtitle((paste("Boat Lights detected from VIIRS on ",i,"\n"))) +
bg <- arrangeGrob(b, sub = textGrob("Source: VIIRS Boat Detection (VBD) Data (PROTOTYPE) - NGDC (2015)\n", hjust = -0.33,
      vjust=-0.3,gp = gpar(fontface = "bold", fontsize = 8, fontfamily="Tw Cen MT Condensed")))
# Set path for Image Magick convert program <- paste0(shortPathName("C:/Program Files/ImageMagick-6.9.0-Q16)/"),"convert.exe")
# Convert png files to GIF animation
files = sprintf('boatlights_%s.png', c(date))
im.convert(files, output = 'BoatLights_Feb2015.gif')
Created by Pretty R at

Friday, 6 March 2015

Will I choke? – when is the best time for outdoor activities on Penang Island

Key points of post
  • To assess whether air pollution index (API) levels in Penang during 2014 are unhealthy
  • To assess how API varies seasonally and diurnally

Last year our blog post on air quality on Penang Island was particularly popular. In that post we used a comparatively crude ‘visibility index’ collected at Penang International Airport since 2001.  The data suggested that pollution (visibility) had gotten worse since 2001 and that it tended to be worse in June and July.

We discovered recently, however, that the Malaysian Government’s Department of the Environment (DOE) collects much more detailed data on air pollution at 52 stations across the country, see  Conveniently, one of the stations is located at Universiti Sains Malaysia (USM) and we've managed to acquire this data scraped off the DOE site by Keith Rozario (nice work Keith!) using python and MySQL. He provides a nice summary of the nationwide API data from August 2013 to February 2015 and the link to all the scripts he used (on github) to scrape the data on his site - please see He was also generous enough to make a zipped copy of the dataset available for the public to use under the Creative Commons Attribution 4.0 license.

Inspired by an analysis ( by ‘Typhoon Tony’ from Taiwan, we sought to investigate the magnitude of air pollution on Penang Island, and how it varies seasonally and diurnally

Air pollution is typically quantified using the air pollution index (API) which is calculated from the following five major air pollutants: sulphur dioxide, nitrogen dioxide, carbon monoxide, particulate matter, and ozone, for details see APIs between 101 and 200 are considered to be ‘unhealthy’ especially for those with existing heart and/or lung problems.

So how does Penang stack up? We’re always hearing how terrible pollution is on the island, but is it actually true ? In the ‘calendar’ plot below, we merged the API data with hourly observations on wind-speed and direction from Weather Underground.  The seasonality in API is clear. We had high API values in February and March 2014, but the highest were in June and July which corresponds to the south-west monsoon transporting smoke from forest fires in Sumatera.  

We summarize diurnal variability in API (and to a certain extent seasonality too) in the ‘rose’ plot below.  It suggests that only rarely can air pollution on the island be considered ‘unhealthy’ since it is typically below 100.  If you are, however, concerned your best bet for strenuous outdoor activities is during the night or in the morning up until 1pm when levels of API start to increase.  The least healthy time of day to be out is rush hour between 4pm and 6pm. You may also want to avoid February, March, June and July!

Do note, however, that the API monitoring station at USM is relatively isolated from the island’s urban pell-mell and may not be a good overall reflection/summary of air pollution on the island. 

The R code used to produce the plots shown above are provided below. We are currently working on trying to scrape the API data off the Malaysian DOE website using R (which would make things a lot more convenient!) and once we've managed to do that, we will update the code accordingly.

# Load required libraries
# Set working directory
# Read data from csv file 
df <- read.table('USM.csv',header=F,sep=":")
# Rename columns
colnames(df) <- c("station_id","station_name","dates","hour","api","unk","unk1")
# Remove last 2 columns - not relevant information
df <- df[-c(6,7)]
# Add minute column
df$minute <- as.character("00")
df$time <- as.character(paste(df$hour,df$minute,sep = ":" ))
df$date <- as.POSIXct(paste(df$dates,df$time, format = "%Y-%m-%d %H:%M" ))
# Convert 0 to NA
df$api[df$api == 0] <- NA
# Date and time columns
df$weekday = as.numeric(format(as.POSIXlt(df$date),"%u"))
df$week <- as.numeric(format(as.POSIXlt(df$date),"%W"))
df$day = strftime(df$date,'%A')
df$jday <- strptime(df$date, "%Y-%m-%d")$yday+1
df$hour <- as.numeric(format(strptime(df$date, format = "%Y-%m-%d %H:%M"),format = "%H"))
##### Creating roseplot of pollutant magnitude over 24 hours #####
# Subset data for 2014
df2014 <- subset(df,year==2014)
# Set Color Table
colortable = c("#99FFFF", "#00FFFF", "#00FF00", "#CCFF33", "#FFFF00",
               "#FFCC00", "#FF6600", "#FF3333", "#FF33CC", "#660033")
# Cut data into ten parts
Time_of_Day = df2014$hour[df2014$station_name==STN]
mag = cut_number(round(df2014$api,100)[df2014$station_name==STN],n = 10)
rosedata = data.frame(dir=Time_of_Day,mag=mag)
# Plot rose chart
rc <- ggplot(rosedata,aes(x=Time_of_Day,fill=mag))+  geom_bar(binwidth = 1, drop = TRUE) +
      coord_polar() + xlab("") + ylab("") + 
      scale_x_continuous(breaks = seq(0,23),
      "10:00pm","11:00pm")) +
      ggtitle("\nDaily API readings in 2014") +  scale_fill_manual(values=colortable,name="API Reading")+
      legend.title=element_text(size=13,colour="grey20",face="bold",family="Myriad Pro"),
      legend.text=element_text(size=11,colour="grey20",face="bold",family="Myriad Pro"),
      axis.text.y=element_text(size=10,colour="grey20",family="Clear Sans"),
      axis.text.x=element_text(size=10,colour="grey20",face="bold",family="Clear Sans"),
      plot.title = element_text(lineheight=1, face="bold",size = 20, colour = "grey20",family="Graphik-Black"),
      plot.margin = unit(c(-0.5,0.25,-0.25,0.25), "in"),
# Save plot to png
### Reading weather data from Weather Underground using weatherData library ####
# Get weather data for 2014
we <- getWeatherForDate("WMKP", "2014-01-01","2015-12-31", opt_detailed=T, opt_custom_columns=T, custom_columns=c(1,6,7,8,9,13,14))
# Rename columns
colnames(we) <- c("date","date_myt","vis_km","wda","ws","gspd_kmh","wd","date_utc")
# Create date and time columns
we$hour <- as.numeric(format(strptime(we$date, format = "%Y-%m-%d %H:%M"),format = "%H"))
we$min <- as.numeric(format(strptime(we$date, format = "%Y-%m-%d %H:%M"),format = "%M"))
# Only use data on the hour
we_df<- subset(we, min == 0)
# Remove duplicate data
# Merge API and weather data
we_api <- merge(we_df, df2014, by=c("date","date")) 
# Reformat yearmonth column for use in openair library
we_api$yearmonth <- as.factor(we_api$yearmonth)
# Reformat wind speed column
we_api$ws[we_api$ws == "Calm"] <- 0
we_api$ws <- as.numeric(we_api$ws)
# Plot 'calendar plot' and save as png
png(filename = "D:/Projects/AP/USM_API_2014_CalendarPlot.png",height=8,width=10,
    bg = "white",units='in', res = 400, family = "",  type = "cairo-png")
calendarPlot(we_api, pollutant = "api",year="2014",
             main = "Daily average API readings at USM\n
             with wind-speed scaled wind direction overlay (2014)\n",
             key.header = "API\n(Air Pollutant Index)",
             cols="heat", annotate = "ws")
Created by Pretty R at