Questions as an initiator of data projects

Unlike many other fields data-science projects should start with focused questions. A table of data is not of much use in itself unless we analyse and understand it with a specific goal in mind. Without a predefined goal we would not know where to start – what to look for in the data, how to analyze it, what extraneous elements should be removed etc.

Take an example. The following data link gives the monthly, seasonal and annual maximum temperatures from 1901 to 2017 for India. The default format is in CSV.

https://data.gov.in/resources/monthly-seasonal-and-annual-max-temp-series-1901-2017-0

We will import the data with R and check some fields to get a feel for the data given.

# Import the data and look at the first six rows and the last six rows.

> weather_data <- read.csv(file = 'E:/localhost/datascience/Max_Temp_IMD_2017.csv')
> head(weather_data)
  YEAR   JAN   FEB   MAR   APR   MAY   JUN   JUL   AUG   SEP   OCT
1 1901 22.40 24.14 29.07 31.91 33.41 33.18 31.21 30.39 30.47 29.97
2 1902 24.93 26.58 29.77 31.78 33.73 32.91 30.92 30.73 29.80 29.12
3 1903 23.44 25.03 27.83 31.39 32.91 33.00 31.34 29.98 29.85 29.04
4 1904 22.50 24.73 28.21 32.02 32.64 32.07 30.36 30.09 30.04 29.20
5 1905 22.00 22.83 26.68 30.01 33.32 33.25 31.44 30.68 30.12 30.67
6 1906 22.28 23.69 27.31 31.93 34.11 32.19 31.01 30.30 29.92 29.55
    NOV   DEC ANNUAL JAN.FEB MAR.MAY JUN.SEP OCT.DEC
1 27.31 24.49  28.96   23.27   31.46   31.27   27.25
2 26.31 24.04  29.22   25.75   31.76   31.09   26.49
3 26.08 23.65  28.47   24.24   30.71   30.92   26.26
4 26.36 23.63  28.49   23.62   30.95   30.66   26.40
5 27.52 23.82  28.30   22.25   30.00   31.33   26.57
6 27.60 24.72  28.73   23.03   31.11   30.86   27.29

> tail(weather_data)
    YEAR   JAN   FEB   MAR   APR   MAY   JUN   JUL   AUG   SEP   OCT
112 2012 23.61 26.44 30.20 32.46 34.30 33.60 31.88 30.96 30.65 30.20
113 2013 24.56 26.59 30.62 32.66 34.46 32.44 31.07 30.76 31.04 30.27
114 2014 23.83 25.97 28.95 32.74 33.77 34.15 31.85 31.32 30.68 30.29
115 2015 24.58 26.89 29.07 31.87 34.09 32.48 31.88 31.52 31.55 31.04
116 2016 26.94 29.72 32.62 35.38 35.72 34.03 31.64 31.79 31.66 31.98
117 2017 26.45 29.46 31.60 34.95 35.84 33.82 31.88 31.72 32.22 32.29
      NOV   DEC ANNUAL JAN.FEB MAR.MAY JUN.SEP OCT.DEC
112 28.11 25.34  29.81   25.03   32.33   31.77   27.88
113 27.83 25.37  29.81   25.58   32.58   31.33   27.83
114 28.05 25.08  29.72   24.90   31.82   32.00   27.81
115 28.10 25.67  29.90   25.74   31.68   31.87   28.27
116 30.11 28.01  31.63   28.33   34.57   32.28   30.03
117 29.60 27.18  31.42   27.95   34.13   32.41   29.69
> 

As we can see the data lists monthly maximum temperatures from the year 1901 to 2017. The data is divided into individual months, yearly and by four seasons. So now that we have the data were do we start? Well, honestly I do not know. Unless someone asks me a specific question relating to this dataset I would be just staring at it. What if someone asks me a particular question.

Has the maximum temperature in the month of April increased over the years?

Now with that particular question I’m ready to answer it using the data given. I can now easily ignore all columns in the dataset and maybe only import the ‘April’ column.

# Modify the dataframe to only include columns 1 and 5 (YEAR, APR)
> weather_data = weather_data[,c(1, 5)]

As this is a simple time-series data we can plot it easily using ggplot. Plotting the data as a time series with smoothing enabled (loess) we will get the following graph. As we can easily see that the maximum temperature in April has indeed increased over the years.

library(ggplot2)

weather_data <- read.csv(file = 'E:/localhost/data/Max_Temp_IMD_2017.csv')

ggplot(data = weather_data, aes(x = YEAR, y = APR)) +
    ggtitle("Maximum temperature in April from 1901 - 2017") +
    geom_line() + 
    geom_smooth(method="loess")

Questions as an initiator of data projects