This blog is run by Jason Benedict and Doug Beare who are based in Penang, Malaysia to share insights and developments on open source software that can be used to analyze patterns and trends in in all types of data from the natural world; with particular emphasis on developing countries. The work featured here was undertaken as part of the CGIAR Research Program on Climate Change, Agriculture and Food Security (CCAFS) -

However, the views presented here are of the authors only, and do not reflect the official position of CCAFS, CGIAR, its members, partners or donors.

Friday, 6 March 2015

Will I choke? – when is the best time for outdoor activities on Penang Island

Key points of post
  • To assess whether air pollution index (API) levels in Penang during 2014 are unhealthy
  • To assess how API varies seasonally and diurnally

Last year our blog post on air quality on Penang Island was particularly popular. In that post we used a comparatively crude ‘visibility index’ collected at Penang International Airport since 2001.  The data suggested that pollution (visibility) had gotten worse since 2001 and that it tended to be worse in June and July.

We discovered recently, however, that the Malaysian Government’s Department of the Environment (DOE) collects much more detailed data on air pollution at 52 stations across the country, see  Conveniently, one of the stations is located at Universiti Sains Malaysia (USM) and we've managed to acquire this data scraped off the DOE site by Keith Rozario (nice work Keith!) using python and MySQL. He provides a nice summary of the nationwide API data from August 2013 to February 2015 and the link to all the scripts he used (on github) to scrape the data on his site - please see He was also generous enough to make a zipped copy of the dataset available for the public to use under the Creative Commons Attribution 4.0 license.

Inspired by an analysis ( by ‘Typhoon Tony’ from Taiwan, we sought to investigate the magnitude of air pollution on Penang Island, and how it varies seasonally and diurnally

Air pollution is typically quantified using the air pollution index (API) which is calculated from the following five major air pollutants: sulphur dioxide, nitrogen dioxide, carbon monoxide, particulate matter, and ozone, for details see APIs between 101 and 200 are considered to be ‘unhealthy’ especially for those with existing heart and/or lung problems.

So how does Penang stack up? We’re always hearing how terrible pollution is on the island, but is it actually true ? In the ‘calendar’ plot below, we merged the API data with hourly observations on wind-speed and direction from Weather Underground.  The seasonality in API is clear. We had high API values in February and March 2014, but the highest were in June and July which corresponds to the south-west monsoon transporting smoke from forest fires in Sumatera.  

We summarize diurnal variability in API (and to a certain extent seasonality too) in the ‘rose’ plot below.  It suggests that only rarely can air pollution on the island be considered ‘unhealthy’ since it is typically below 100.  If you are, however, concerned your best bet for strenuous outdoor activities is during the night or in the morning up until 1pm when levels of API start to increase.  The least healthy time of day to be out is rush hour between 4pm and 6pm. You may also want to avoid February, March, June and July!

Do note, however, that the API monitoring station at USM is relatively isolated from the island’s urban pell-mell and may not be a good overall reflection/summary of air pollution on the island. 

The R code used to produce the plots shown above are provided below. We are currently working on trying to scrape the API data off the Malaysian DOE website using R (which would make things a lot more convenient!) and once we've managed to do that, we will update the code accordingly.

# Load required libraries
# Set working directory
# Read data from csv file 
df <- read.table('USM.csv',header=F,sep=":")
# Rename columns
colnames(df) <- c("station_id","station_name","dates","hour","api","unk","unk1")
# Remove last 2 columns - not relevant information
df <- df[-c(6,7)]
# Add minute column
df$minute <- as.character("00")
df$time <- as.character(paste(df$hour,df$minute,sep = ":" ))
df$date <- as.POSIXct(paste(df$dates,df$time, format = "%Y-%m-%d %H:%M" ))
# Convert 0 to NA
df$api[df$api == 0] <- NA
# Date and time columns
df$weekday = as.numeric(format(as.POSIXlt(df$date),"%u"))
df$week <- as.numeric(format(as.POSIXlt(df$date),"%W"))
df$day = strftime(df$date,'%A')
df$jday <- strptime(df$date, "%Y-%m-%d")$yday+1
df$hour <- as.numeric(format(strptime(df$date, format = "%Y-%m-%d %H:%M"),format = "%H"))
##### Creating roseplot of pollutant magnitude over 24 hours #####
# Subset data for 2014
df2014 <- subset(df,year==2014)
# Set Color Table
colortable = c("#99FFFF", "#00FFFF", "#00FF00", "#CCFF33", "#FFFF00",
               "#FFCC00", "#FF6600", "#FF3333", "#FF33CC", "#660033")
# Cut data into ten parts
Time_of_Day = df2014$hour[df2014$station_name==STN]
mag = cut_number(round(df2014$api,100)[df2014$station_name==STN],n = 10)
rosedata = data.frame(dir=Time_of_Day,mag=mag)
# Plot rose chart
rc <- ggplot(rosedata,aes(x=Time_of_Day,fill=mag))+  geom_bar(binwidth = 1, drop = TRUE) +
      coord_polar() + xlab("") + ylab("") + 
      scale_x_continuous(breaks = seq(0,23),
      "10:00pm","11:00pm")) +
      ggtitle("\nDaily API readings in 2014") +  scale_fill_manual(values=colortable,name="API Reading")+
      legend.title=element_text(size=13,colour="grey20",face="bold",family="Myriad Pro"),
      legend.text=element_text(size=11,colour="grey20",face="bold",family="Myriad Pro"),
      axis.text.y=element_text(size=10,colour="grey20",family="Clear Sans"),
      axis.text.x=element_text(size=10,colour="grey20",face="bold",family="Clear Sans"),
      plot.title = element_text(lineheight=1, face="bold",size = 20, colour = "grey20",family="Graphik-Black"),
      plot.margin = unit(c(-0.5,0.25,-0.25,0.25), "in"),
# Save plot to png
### Reading weather data from Weather Underground using weatherData library ####
# Get weather data for 2014
we <- getWeatherForDate("WMKP", "2014-01-01","2015-12-31", opt_detailed=T, opt_custom_columns=T, custom_columns=c(1,6,7,8,9,13,14))
# Rename columns
colnames(we) <- c("date","date_myt","vis_km","wda","ws","gspd_kmh","wd","date_utc")
# Create date and time columns
we$hour <- as.numeric(format(strptime(we$date, format = "%Y-%m-%d %H:%M"),format = "%H"))
we$min <- as.numeric(format(strptime(we$date, format = "%Y-%m-%d %H:%M"),format = "%M"))
# Only use data on the hour
we_df<- subset(we, min == 0)
# Remove duplicate data
# Merge API and weather data
we_api <- merge(we_df, df2014, by=c("date","date")) 
# Reformat yearmonth column for use in openair library
we_api$yearmonth <- as.factor(we_api$yearmonth)
# Reformat wind speed column
we_api$ws[we_api$ws == "Calm"] <- 0
we_api$ws <- as.numeric(we_api$ws)
# Plot 'calendar plot' and save as png
png(filename = "D:/Projects/AP/USM_API_2014_CalendarPlot.png",height=8,width=10,
    bg = "white",units='in', res = 400, family = "",  type = "cairo-png")
calendarPlot(we_api, pollutant = "api",year="2014",
             main = "Daily average API readings at USM\n
             with wind-speed scaled wind direction overlay (2014)\n",
             key.header = "API\n(Air Pollutant Index)",
             cols="heat", annotate = "ws")
Created by Pretty R at

Monday, 23 February 2015

Why are the British tweeting about climate change at night?

Points of post:

  •  To demonstrate how the library twitteR can be combined with mapping capabilities of ggplot
  • To reveal the locations of people tweeting about climate change at two points in time.
  • To show that climate change tweeters are mostly from the developed world.
  • To show that some (sad) tweeters are twittering about climate change in the middle of the night.

In a previous blog post, we used twitteR and wordcloud to summarize which other words were occurring in tweets (combined from around the world) together with the keywords, ‘climate-change’ and ‘Bangladesh’ at two arbitrarily selected time-points. 

The geo-locations of tweets are, however, also often available and are potentially very interesting and revealing. 

Fifteen-hundred (1500) is the maximum number of tweets that can be captured using the twitteR library with one call.  There probably are ways to get more data but we guess you probably have then to spend money. 

Here we used #climatechange because it coincided with the last days of COP20 in Lima, Peru which ran from 1st to 12th December 2014.  It is an extremely important global forum at which nations can meet and discuss their options for reducing carbon emissions, see

The first ‘tweet map’ (below) we produced is based on approximately 1500 geo-located tweets that contained the hash-tag, #climatechange, and which were ‘tweeted’ at about 10am GMT on the 11th December 2014. It shows that #climatechange tweets were coming from 4 main areas: North America, Europe, India and Australia. There didn’t appear either to be too many tweets coming out of Lima which surprised us. Maybe the delegates were too busy enjoying the South-American hospitality, and catching up with old mates to take much interest in Climate Change!

Geo-located tweets with #climatechange tweeted at around 10am GMT 11th December 2014

The second ‘tweet-map’ (below), also based on approximately 1500 geo-located #climatechange tweets, is for a snapshot that took place at 5 hours later at around 3am GMT on the final day of the conference (12th December 2014).  The overall pattern between the maps remains the same but the relative frequency of #climatechange tweeters from Europe, as compared to North America, has increased. People in the United Kingdom were particularly keen, twittering like mad about climate change at 3am. Why? We don’t know.

Geo-located tweets with #climatechange tweeted at around 3am GMT 12th December 2014

Note that tweets are geo-located, either by exploiting the users’ location as defined in their profile, or by ascertaining the exact location directly  if allowed by the user. This can be effected, either from GPS-enabled software which many people have installed on their smart-phones, or by using an IP-address.  This means that not all tweets can be geo-located with any great precision. Some are only geo-located at the National and/or regional levels, as evident from the large circle in the middle of Australia. That’s to say these cautious tweeters only gave ‘Australia’ as their location.

As we have explained in an earlier blog post on worldclouds and twitteR, to pull the data from Twitter using its API, you will need to have a Twitter account and carry out a 'twitter authentication'.  The R code to perform a search on twitter for the selected 'term(s)' and mapping them out is detailed below.

# Load required libraries
# Set working directory
#### Fonts on Windows ####
windowsFonts(ClearSans="TT Clear Sans")
# Load Credentials
load("D:/ClimData/Twitter/twitter authentification.Rdata")
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
# Search term on twitter
searchTerm <- "#climatechange"
searchResults <- searchTwitter(searchTerm,n=1500,since='2014-12-11', until='2014-12-12')  
tweetFrame <- twListToDF(searchResults) 
userInfo <- lookupUsers(tweetFrame$screenName)  
userFrame <- twListToDF(userInfo)
locatedUsers <- !$location)
# Geocode locations using 'ggpmap' library
locations <- geocode(userFrame$location[locatedUsers])
locations_robin <- project(as.matrix(locations), "+proj=robin")
locations_robin_df <-
# Import world boundaries
world <- readOGR(dsn="D:/Data/ne_10m_admin_0_countries", layer="ne_10m_admin_0_countries")
world_robin <- spTransform(world, CRS("+proj=robin"))
world_robin_df <- fortify(world_robin)
counts <- aggregate(locations_robin_df$V1,by=list(x=locations_robin_df$V1,y=locations_robin_df$V2),length)
names(counts)[3] <- "count"
# Theme options for Map
theme_opts <- list(theme(panel.grid.minor = element_blank(),
                         panel.grid.major = element_blank(),
                         panel.background = element_blank(),
                         panel.border = element_blank(),
                         plot.background = element_blank(),
                         axis.line = element_blank(),
                         axis.text.x = element_blank(),
                         axis.text.y = element_blank(),
                         axis.ticks = element_blank(),
                         axis.title.x = element_blank(),
                         axis.title.y = element_blank(),
                         legend.position = "bottom",
                         legend.key = element_blank(),
                         legend.title = element_text(colour="black", size=12, face="bold",family="Clear Sans"),
                         legend.text = element_text(colour="black", size=10, face="bold",family="Clear Sans"),
                         plot.title = element_text(size=15,face="bold",lineheight=0.5,family="Clear Sans")))
# Plot map and tweet counts 
tp <- ggplot(world_robin_df)+
      geom_polygon(aes(x = long, y = lat, group = group), fill = "grey20")+
      geom_path(aes(x = long, y = lat, group = group),colour = "grey40", lwd = 0.2)+
      geom_point(data= counts, aes(x=x,y=y,size=count),color="#32caf6", alpha=I(8/10))+
      scale_size_continuous(name="Number of tweets")+
      ggtitle("Twitter Map of #climatechange\n")+
      xlab("")+ ylab("")+
      theme_bw() + 
      guides(size = guide_legend(title.position = "top",title.hjust =0.5))+
# Save to png
Created by Pretty R at

Friday, 30 January 2015

The recent deluge in Malaysia - using raincpc to map extreme rainfall events

Key point of post

  • To describe an application of raincpc to map the rainfall that led to the recent (December 2014) floods in Malaysia

Raincpc ( is a new library for R that exploits the Climate Prediction Center’s (CPC, global (1979 to present, 50km resolution) datasets for precipitation. It renders CPC’s rainfall data more readily available for plotting and visualization, allowing any user to conveniently side-step problems relating to changing data formats, file-naming conventions etc. And all this free of charge!

We thought it would be fun to demonstrate the use of raincpc, focusing on Malaysia which experienced devastating floods over the Christmas and New Year Period, leading to the evacuation of 1000s of people. The damage has been estimated to have cost ~ 2 billion RM. Please see the following links - and

In the plot below we used raincpc to show the amount of rain that fell over south-east Asia between 17th and 24th December 2014. It confirms that rainfall was indeed particularly heavy along the east coast of peninsular Malaysia; but also over northern Sumatera. Penang was certainly wet during December but the island had nothing like the amount of rainfall endured by communities on Malaysia’s east coast.

We do not know what caused the extreme rainfall that led to the flooding. Meteorologists think that it is related to the 'Madden-Julian Oscillation' ( and it's interaction with the north-east Monsoon. Very heavy rain is of course common in the tropics, but it doesn't neccesarily lead to flooding if drainage is adequate. Some experts think that rampant deforestation in Malaysia has led to more siltation of rivers, in effect blocking Malaysia's drains, and this exacerbates the impact of rainfall events (

As usual the R-code for producing the map is outlined below.
## Load package libraries
# Set working directory
## Get raw CPC rain data - data has a 2 day lag
## Read raw CPC rain data into raster grids
rain1 <- cpc_read_rawdata(2014, 12, 17)
rain2 <- cpc_read_rawdata(2014, 12, 18)
rain3 <- cpc_read_rawdata(2014, 12, 19)
rain4 <- cpc_read_rawdata(2014, 12, 20)
rain5 <- cpc_read_rawdata(2014, 12, 21)
rain6 <- cpc_read_rawdata(2014, 12, 22)
rain7 <- cpc_read_rawdata(2014, 12, 23)
rain8 <- cpc_read_rawdata(2014, 12, 24)
# Combine multiple day rasters
rain_tot <- rain1 + rain2 + rain4 + rain5 + rain6 + rain7 + rain8
# Get summary of raster grid
raster_ggplot <- function(rastx) {
  stopifnot(class(rastx) == "RasterLayer")
  gfx_data <- getXYcoords(rastx)
  # lats need to be flipped
  gfx_data <- expand.grid(lons = gfx_data$x, lats = rev(gfx_data$y), 
                          stringsAsFactors = FALSE, KEEP.OUT.ATTRS = FALSE)
  gfx_data$rain <- rastx@data@values
  return (gfx_data)
rain_gg <- raster_ggplot(rain_tot)
# Read shapefile of country boundaries (shapefiles can be downloaded from
bounds <- readOGR(dsn="D:/Data/World_Borders", layer="TM_WORLD_BORDERS-0.3")
## Extents of ggplot map
interval <-(xmax-xmin)/5
lon_vals <- seq(xmin, xmax, 0.5)
lat_vals <- seq(ymin, ymax, 0.5)
# Set theme options
theme_opts <- list(theme(panel.grid.minor = element_blank(),
                         panel.grid.major = element_blank(),
                         panel.background = element_rect(fill="grey95"),
                         panel.border = element_rect(colour="black"),
                         axis.line = element_blank(),
                         axis.text.x = element_blank(),
                         axis.text.y = element_blank(),
                         axis.ticks = element_blank(),
                         axis.title.x = element_blank(),
                         legend.text=element_text(size=14,family="Myriad Pro Cond"),
                         legend.title=element_text(size=16,family="Myriad Pro Cond"),
                         plot.title = element_text(size=23,face="bold",vjust = -10,hjust=0.96,family="Graph Black"),
                         legend.position = c(0.17, 0), 
                         legend.justification = c(1, 0), 
                         legend.background = element_blank(),
                         axis.title.y = element_blank()))
# Plot rainfall map
rf <-  ggplot()+
       geom_raster(data=rain_gg,aes(x=lons,y=lats,fill=rain),alpha=0.8) +
       scale_fill_gradientn(colours=c("#FFFFFF","#ADD8E6","#95BBE9","#7E9EEC","#6781F0","#5064F3","#3948F6","#222BFA","#0B0EFD","#0A02FE","#1F06FC","#350AFA","#4A0EF8","#6013F6","#7517F3"),limits=c(0,1200),na.value=NA, name="Rainfall (mm)\n")+
       geom_polygon(data=bounds, aes(long,lat, group=group),colour="grey30",fill="transparent",size=0.35)+
       coord_equal(xlim=c(xmin,xmax),ylim=c(ymin,ymax)) + theme_bw()+
       ggtitle("Extreme rainfall event over Malaysia\n(17th to 24th of December 2014)\n")+
       xlab("") + ylab("")+ theme_opts +
       annotate("text", x = 115.00, y = -9.5, label = "Data source: Climate Prediction Center - NOAA (2014)",size=5,family="Myriad Pro Cond") 
# Save rainfall map to png file
Created by Pretty R at

Thursday, 8 January 2015

2014’s weather in Penang: a brief summary

Key points of post:

  • 2014 was the 2nd warmest year in Penang since 1975.
  • During 2014 average daily air temperatures exceeded record highs (since 1975) on 54 days.
  • Nearly 50% of days in June 2014 broke record temperature highs
  • Two unusually cold days were experienced in late December 2014.
  • 2014 was also the driest since 2005.

We’ve been blogging about weather in Penang on and off for a year now and thought it would be interesting to review the past year as we move into 2015.

In March 2014, we described the unusually dry start to 2014, the forest fires on Penang Island, and the weak negative relationship between rainfall and temperature.  In a subsequent blog post, we speculated on whether the total cumulative rainfall that fell in Penang in 2014 would be able to catch up with more ‘normal’ amounts.  Since then, we’ve experienced a rather wet December in Penang which reflects more substantial precipitation elsewhere.

The plot below summarizes the seasonal change in air temperature at Penang International Airport based on daily observations.  The 365 vertical black lines are the average air temperatures each day, ± 2 standard deviations, and represent the ‘normal’ temperature ranges for Penang based on 40 years of observations (ie. 1975 to 2014).  The buff colored lines denote the range between the record temperature highs and lows each day. 

The solid black line describes the average daily air temperature for 2014. The red circles on this line represent days in 2014 when the average temperatures exceeded the 40 year record high, ie. they were exceptionally warm days. Similarly the blue ones denote days which days were extraordinarily ‘cold’.

Overall 2014 was very warm; the black line being well above the ‘normal’ range for most months. Both June and July 2014 were unusually hot this year in Penang.  [Luckily I was in the UK on leave at that time!]  To put this into perspective, during June 2014 average temperatures on 14 out of 30 days broke record highs, that’s to say nearly 50% out of a possible 30 days.  Overall in 2014 average air temperatures were higher than the 40 year average on an amazing 54 days (~15% of all days).

Those of us who were in Penang during Christmas 2014 will recall the unusual amount of rain which, certainly in the tropics, tends to depress air temperatures. This feature of Penang’s recent climate is also well captured by our graphic, the black line being below the normal range for the last 2 weeks of December when 2 days also experienced record lows. Indeed I was on Monkey Beach on 23rd December. It felt more like Scotland and I wished I’d bought a substantial anorak.

Cumulative rainfall for the last decade is plotted below. It shows that 2014 never caught up and was an unusually dry year overall. The average rainfall in Penang ranges from 2250 to 2900 mm annually but 2014's annual rainfall was well below 2000 mm! The high temperatures observed in June were also coincident with no rain, see flat line during June in plot below.

The temperature plot produced above is based on Tufte's illustration of New York's weather in 2003 published in the New York Times, January 4, 2004 and also his classic book Visual Display of Quantitative Information, 2nd Ed. (page 30).

The code presented below that was used to produce the temperature plot has been modified slightly from the code included in the post published on RPubs by Brad Boehmke. The temperature and precipitation data used for the plots above are acquired from the usually dependable NOAA NCDC's Global Summary of the Day (GSOD)

# Code to produce Temperature Plot
# Load required libraries
# Load font
windowsFonts(GraphBlack="TT Graph Black")
# Set working directory
# Read weather data downloaded from NOAA NCDC GSOD
# Rename columns
# Reformat columns
dat$yearmoda <- strptime(dat$yearmoda,format="%Y%m%d")
dat$tempdc <- (dat$temp-32) * (5/9)
dat$year <- as.numeric(format(dat$yearmoda,"%Y"))
dat$month <- as.numeric(format(dat$yearmoda,"%m"))
dat$day <- as.numeric(format(dat$yearmoda,"%d"))
temp <- dat[c(23,24,25,26)]
names(temp) <- c("temp", "year", "month", "day")
temp <- temp %>% group_by(year) %>% mutate(daynum = seq_along(year))
# Set up plain chart with min-max range and 95% CI
(p <- ggplot(temp, aes(x = daynum, y = temp)) + 
      stat_summary(geom = "linerange", 
      fun.ymin = min, 
      fun.ymax = max, 
      color = "wheat2") +
      stat_summary(geom = "linerange", 
      fun.ymin = function(x) mean(x) - 2 * sd(x)/sqrt(length(x)),
      fun.ymax = function(x) mean(x) + 2 * sd(x)/sqrt(length(x)), 
      color = "wheat4") + 
      geom_line(data = filter(temp, year == 2014)))
# Data frame containing all days in 2014 with extreme weather
df_maxmin <- temp %>%
             group_by(daynum) %>%
             mutate(max_temp = max(temp), 
             min_temp = min(temp)) %>%
             filter(year == 2014, (temp %in% c(max_temp, min_temp))) %>%
             mutate(max_min = temp == max_temp) # Dummy variable to be mapped to color
# Data frame with x-axis breaks and labels
df_xaxis <- temp %>% filter(year == 2014, month != lead(month)) %>%     # Last day of month
            mutate(days_in_month = daynum - lag(daynum),                # Days in month
            midpoint = lag(daynum) + days_in_month / 2)                 # Month midpoints
df_xaxis$midpoint[1] <- 31 / 2
(p <- p  +
      geom_vline(xintercept = 0, color = "wheat4", size = 1) +
      geom_hline(yintercept = seq(22, 32, 2), color = "white") +
      geom_vline(xintercept = df_xaxis$daynum, 
      color = "wheat4", linetype = "dotted", size = 0.5) +
      geom_point(data = df_maxmin, aes(color = max_min), show_guide = FALSE))
(p <- p +
      scale_x_continuous(expand = c(0,0), labels =,
      breaks = c(df_xaxis$midpoint, df_xaxis$daynum[11] + (31/2))) +
      scale_y_continuous(expand = c(0,0), breaks = seq(22, 32, 2),
      labels = function(x) parse(text = paste0(x, "*degree"))) +
      scale_color_manual(values = c("blue3", "firebrick3")))
(p <- p + theme(axis.ticks = element_blank(), 
      panel.grid = element_blank(),
      panel.background = element_blank(),
      panel.border = element_blank(),
      axis.text = element_text(color = "gray30"),
      plot.title = element_text(face = "bold", hjust = 0.012, 
      vjust = 0.8, color = "#3C3C3C", size = 25,family="Graph Black")) +
      labs(x = NULL, y = NULL, title = "Penang's Weather in 2014"))
desc <- "Data represents average daily temperatures. Temperature data used starts from 
January 1, 1975. Average temperature for the year was 28.4° making 2014
the 2nd warmest year since 1975" %>% 
strwrap(width = 0.75 * getOption("width")) %>% 
paste0(collapse = "\n")
# Data frame with annotations
df_annotate <- data_frame(
               daynum = c(17, 287), temp = c(24.5, 30.5), max_min = c(FALSE, TRUE), 
               label = c("We had 4 days that were the\ncoldest since 1975", 
               "We had 54 days that were\nthe hottest since 1975"))
(p <- p + 
      annotate("text", x = 5, y = 31.7, size = 4, fontface = "bold", 
      hjust = 0, vjust = 0, label = "Temperature",family="Clear Sans") +
      annotate("text", x = 5, y = 31.6, size = 3, color = "gray30", 
      hjust = 0, vjust = 1, label = desc,fontface = "bold",family="Clear Sans") +
      geom_segment(data = df_annotate,aes(x = c(15, 285), xend = c(10, 282), 
      y = c(24.5, 30.5), yend = c(25.5, 29.8),
      color = c(FALSE, TRUE)), show_guide = FALSE) + 
      geom_text(data = df_annotate, aes(color = max_min, label = label), 
      size = 3, hjust = 0, ,show_guide = FALSE,family="Clear Sans",fontface="bold"))
# Data frame with legend label coordinates
df_leg_text <- data_frame(daynum = c(186, 145, 184, 184), 
               temp = c(23.5, 23.5, 24,23), 
               label = c("NORMAL RANGE", "2014 TEMPERATURE", 
               "RECORD HIGH", "RECORD LOW"))
# Data frame with legend shape coordinates
df_leg_seg <- data_frame(daynum = c(181, 181, 183, 183, 185), 
              xend = c(181, 181, 185, 185, 185),
              temp = c(23, 23.25, 23.75, 23.25, 23.25),
              yend = c(24, 23.75, 23.75, 23.25, 23.75), 
              size = c(3, 3, 0.5, 0.5, 0.5), 
              color = c("wheat2", rep("wheat4", 4)))
p1 <- p + 
      geom_segment(data = df_leg_seg, aes(xend = xend, yend = yend), 
      size = df_leg_seg$size, color = df_leg_seg$color) +
      geom_line(data = data_frame(daynum = seq(175, 182), temp = rnorm(8,23.5,0.15))) +
      geom_text(data = df_leg_text, aes(label = label), hjust = 0, size = 2,fontface = "bold",family="Clear Sans")
# Save plot to png
Created by Pretty R at


# Code to produce cumulative precipitation plot
# Load required libraries
# Setting work directory
# Reading and reformatting GSOD raw data downloaded from NCDC
dat$yearmoda <- strptime(dat$yearmoda,format="%Y%m%d")
dat$prcp <- as.character(dat$prcp)
dat$prcp1 <-as.numeric(substr(dat$prcp,1,4))
dat$prcpflag <- substr(dat$prcp,5,5)
# Convert precipitation from inches to mms
dat$rain  <- dat$prcp1*25.4
# Remove erronous values
dat$rain[dat$rain > 1000 ] <- NA
dat$year <- as.numeric(format(dat$yearmoda,"%Y"))
dat$month <- as.numeric(format(dat$yearmoda,"%m"))
dat$day <- as.numeric(format(dat$yearmoda,"%d"))
# Getting cumulative sum of rain/year
# Subsetting required period
dat2 <- subset(dat, year >= 2005 )
# Extracting required columns for transforming data
dat3 <- dat2[, c(25,29)]
# Replace na's with 0's for ddply function
dat3$rain[$rain)] <- 0
dat3 <- ddply(dat3,.(year(date)),transform, cumRain = cumsum(rain))
dat4 <- ddply(dat3,.(date,year(date)),summarize, max = max(cumRain))
dat5 <- dat4[c(diff(as.numeric(substr(dat4$date, 9, 10))) < 0, TRUE), ]
dat5$year <- as.numeric(format(dat5$date,"%Y"))
dat5$month <- as.numeric(format(dat5$date,"%m"))
dat5$day <- as.numeric(format(dat5$date,"%d"))
# Calculate julian day
dat5$jday <- strptime(dat5$date, "%Y-%m-%d")$yday+1
# Data frame with x-axis breaks and labels
dataxis <- dat5 %>% group_by(year) %>% mutate(daynum = seq_along(year))
df_xaxis <- dataxis %>% filter(year == 2014, month != lead(month)) %>%  # Last day of month
            mutate(days_in_month = daynum - lag(daynum),                # Days in month
            midpoint = lag(daynum) + days_in_month / 2)                 # Month midpoints
df_xaxis$midpoint[1] <- 31 / 2
# Plot cumulative rainfall
cr<-  ggplot(dat3, aes(x = yday(date), y = cumRain, color = factor(year(date)))) +
      geom_line(size=0.5,linetype='solid') + geom_point(size=1.2) + theme_bw() +
      ggtitle("Penang's Cumulative Rainfall by Year (2005 - 2014)") + 
      guides(color = guide_legend(title = "Year", title.position = "top")) +  
      geom_hline(yintercept = seq(0,3000, by=500), color = "wheat4",linetype="dotted",size=0.5) +
      geom_vline(xintercept = df_xaxis$jday, color = "wheat4", linetype = "dotted", size = 0.5) +
      geom_vline(xintercept = 0, color = "grey20", size = 1) + 
      scale_x_continuous(expand = c(0, 0), limits=c(0,380),
      breaks = c(15,45,75,105,135,165,195,228,258,288,320,350),
      labels = c("January", "February", "March", "April",
      "May", "June", "July", "August", "September",
      "October", "November", "December"))+
      xlab("") + ylab("Rainfall (mms)\n")+ 
      theme(panel.border = element_rect(colour="grey20",size=0.5),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      axis.title.y=element_text(size=14,face="plain",family="Clear Sans"),
      axis.text.x=element_text(size=12,face="plain",family="Clear Sans"),
      axis.text.y=element_text(size=12,face="plain",family="Clear Sans"),
      legend.text=element_text(size=10,face="plain",family="Clear Sans"),
      legend.title=element_text(size=10,face="bold",family="Clear Sans"),
      plot.title=element_text(size=20,face="bold",family="Clear Sans",hjust = 0.012, vjust = 1.2),
cr <- cr + geom_text(data = subset(dat5, jday > 350 ), (aes(x = yday(date), y = max, label = year(date))),size=4,vjust=-0.2, hjust=-0.2,fontface="bold",family="Clear Sans")
# Save plot to png
ggsave(cr, file="Cumulative_RF_Penang_r1.png", dpi=500,width=15, height=7,type = "cairo-png")
Created by Pretty R at