Crime in Montreal: Lower in Winter, and lower at night

Since I’m currently visiting the Montreal area, I figured why not become more familiar with the city and its crime. But why crime? Surely there are nicer things to learn about Montreal than its crime rates? The reason why I’ve picked crime is that police data on crimes is usually publicly accessible, and contains geographic and date information. It’s an easy source of data that you can explore with.

Background on Data

I first discovered this visualization when looking for Montreal crime data. Besides the fact it is only in French, it only allows you to filter by type of category and date. Once I found this visualization though, it was only a hop and a skip to finding the raw data behind it: http://donnees.ville.montreal.qc.ca/dataset/actes-criminels

Also incredibly helpful, is the fact that this page provides some background on methodology and a data dictionary. You can read the full details on the website but essentially, for privacy reasons, they made the location of the incident at a street intersection and exact time grouped into periods of day, evening, and night.

I noticed right away that some information provided in the raw data isn’t captured by the current visualization. Such as time of day and the neighborhood station number that covers the area where the criminal act occurred. Perhaps we can explore this information more.

Building the Visualizations

I decided to use Tableau to build my visualizations. 1) Because it is pretty 2) Can easily publish and share 3) Time investment is lower compare to other options (I didn’t want to spend more than a couple hours on it).

One thing I had hope to do in the Tableau is to have the ability to switch between English and French in the visualization depending on the user. Unfortunately, while you can change the language of the Tableau software, you can’t really have labels/titles shift based on a language selection parameter (https://community.tableau.com/ideas/1649). Apparently the closest you can come to it is hacking XML but it still results in separate workbooks for each language (https://tableauandbehold.com/2015/08/20/full-localization-of-tableau-workbooks/). Obviously the level of workaround is quite high for this particular issue and doesn’t even result in a ideal finished product. Cue the hate part of the hate/love relationship with Tableau

Despite this initial disappointment, I was at least able to easily change the Alias/Titles to the English translation (apologies in advance for any mistranslations, I was relying on google translate). After some dragging/dropping, and playing around with colors, I was able to produce this.

Conclusion

Based on my visualizations, it looks like overall, most crime incidents in Montreal happen during the day (i.e. not evening or night).

This in itself isn’t terribly surprising. Many people commute into Montreal for work so they aren’t staying there during evening/night hours. One might suppose that during the times there are less people in Montreal, there tends to be less incidents. When we look at incidents by category, we begin to see a different story emerge. For example, breaking and entering into a residence, tends to have an equal number of incidents between day hours and evening hours.

Also defying the overall trend, it seems that acts of robbery occur the most often in the evening.

And for incidents like murder, we don’t see a particular pattern at all.

We see another story when we look at criminal acts reported by time (i.e. month). It seems that in the winter months, in general there are less criminal acts registered.

Visually, it looks like January and February are low points in incidents across all categories although the drop is more severe in some categories than in others. If I had to assume why, I would guess this is largely due to how cold it is in Montreal in the winter. People are less likely to hang around outdoors and there is probably also a drop in tourism to the island.

In regards to my last visualization, I won’t go into too much detail other than as you filter between different categories, you will notice that certain station numbers deal with certain incidents more often than others.

Future Directions

Hopefully this was a good crash course on exploring the SPVM data on registered criminal acts, but there is always room for improvement. If I wanted to invest some more time, I could probably put in some different views by station number and add in the geography for those neighborhoods by finding a secondary source. I could also add in data on temperature highs by date from another source.

Tableau: Pros and Cons

Over the last few years, we have seen a rise of Tableau being used in industry. For those not familiar with Tableau, it is a software product that you use for data visualizations. I highly recommend you check out the public gallery so you can get an idea of the range of visualizations that can be created: https://public.tableau.com/en-us/s/gallery

As someone who has used Tableau extensively over the last couple of years, I like to think I have a pretty good grasp on Tableau.

Pros

Let’s be honest here. The best thing about Tableau is how visually appealing the visualizations you can create are. It really is the most attractive quality it has. Plus, it is so easy to make it interactive and to quickly change features on it. That is not something that I would say about visualizations I have to code by hand (R Shiny, etc.)

Cons

It is a proprietary software. You are learning how to use it, not how to code something up. And what makes it so easy to use, can become a curse if you are trying to do something very specific and custom. Do you want a diagonal line in your plot? Well, be prepared to go through a bunch of hoops and hacks to do so in a software that is suppose to be intuitive and easy to use. But if I’m going to be honest, one of the things that bugs me the most is how confusing all the software and pricing options are. And how often they change: https://www.tableau.com/pricing

Pricing

As of right now, Tableau Creator is an individual license for Tableau Desktop and Tableau Prep with the ability to create for Tableau Server/Online. Tableau Desktop is actually the software you use to create visualizations. Tableau Prep is what you can use to prep the data for visualization work. And you can use Tableau Server or Tableau Online to share your visualizations with others. It should be noted that these options are separate from Tableau Public. Tableau Public is where visualizations can be shared publicly with anyone (no privacy options). You could use Tableau Public and not download Tableau Desktop, but the functionality in Tableau Public is fairly limited.

Before we move on in the pricing conversation, I feel like I should talk about Tableau Prep. This is a new software that has come out in the last couple of years in response to complaints on how hard it is to manipulate data and its shape within Tableau Desktop. When I was told they were working on something to make it easier, I assumed it was going to be within Tableau Desktop, but instead we got a separate piece of software. In all honesty, I don’t really use Tableau Prep. As someone from the data analysis field, I have other tools (i.e. R and python) that are free to use and often already built into my data pipelines. Why use a proprietary software when I can just as easily do data manipulation on my own? Because of this, I wished they offered license option without Tableau Prep.

Then of course, you have the group options which can allow explorers/viewers. You can set up something on-premises/cloud (Tableau Server) or have one hosted by Tableau (Tableau Online). However, you will need at least one ‘Tableau Creator’ and the fine print specifies that for the explorers/viewers categories there are required minimums. Now what are the distinctions between a Tableau viewer and explorer? A Tableau viewer is someone who can view a visualization within a group server and interact with it but they can’t download the data or edit existing visualizations, or manage user permissions. A Tableau explorer is someone who has all the ability of a Tableau viewer, and can download data, edit existing visualizations, and manage user permissions. In the workplaces I’ve been in though, not many people fall under the Tableau explorer category. The people who are going to edit an existing visualization are likely the same people who are going to create and publish new workbooks, and would need a creator license anyway.

Note: If you work in academia (student/instructor/researcher) and are using Tableau for non-commercial purposes, you can get ahold of Tableau Desktop (i.e. Creator license) for free.

Student: https://www.tableau.com/academic/students

Instructor: https://www.tableau.com/academic/teaching

It is a little unclear which method Researchers should use to get ahold of a Tableau Desktop. There used to be a separate method but it looks like at the moment, going the Instructor route is the most appropriate.

Conclusion

As always with Tableau, it is a bit of a hate/love relationship. If you want something custom (i.e. deviates from the options readily available), be prepared to spend some time as well as extensive googling. However, the beauty of Tableau can’t be denied and there is something very satisfying about how quickly one can generate a scatter plot and filter it out by a variety of metrics. If you have some coding experience already, I recommend learning how to use Tableau as it is widely in demand but not often taught on.

R vs. Python

R and Python are major programming languages in the world of data science and both happen to be open-sourced (i.e. not proprietary and free to use). But which one should you used? As always, there is no clear cut answer but hopefully after reading this you will have a better idea of which one to pick.

Python

Python is object-oriented programming (OOP) language. What does this mean? This means that we are manipulating data (i.e. objects) rather than the logic around the data. Other languages that fall under this categorization are Java, C++, etc. However, unlike Java, Python’s syntax is much more simple. Additionally, while Python is open-sourced, it takes the approach that all functionality is built in to begin with. In terms of interface, Python is often run through the command line, but you can write in Jupyter notebooks or use the software PyCharm if you prefer that interface instead.

There is a high demand in industry for python in areas of web development, AI and Machine learning, etc due to the fact that python is more focused on deployment and production. Learning this language allows you to be a very attractive employee to industry. If you are interested in getting a job in programming in Python then you should learn the syntax, common modules (i.e. numpy), differences between Python 2 and 3, and uses of python in industry.

R

Unlike python, R is not an object-oriented programming but a procedural language (that being said, there are packages you can use though in R to do object oriented programming). R can be hard to learn at first though for people not familiar with programming. However it has an extensive and active online community, and a wide variety of packages. The main interface used for R is RStudio and allows easy viewing of data and variables. Additionally, R (and by extension R Shiny) is great at making visualizations.

R focuses on data analysis and statistics, and while certainly used in industry, is perhaps not valued as highly Python. R tends to be more favored in academia or research and development.

Conclusion

So which one should you learn? While you can accomplish many of the same things in both languages, I would recommend learning both. Why?

R (via RStudio) allows easy navigation of data and has a strong selection of data visualization packages, making it a strong workhorse of research and development.

Python has a wide variety of applications in industry and an important component of production based work.

Like any other skill, you should develop your programming languages and keep up on developments. If you don’t have the opportunity to learn a language via your current job, then take the time to develop it on you own. It will pay dividends later on in your career.

Working with JSON in python

It seems only fair that if we are going to talk about how to handle pseudo JSON files in R, that we should also talk about how to handle them in python. Similar to our previous example in R, we will use JIRA API to pull some JSON like data from JIRA.

import json, urllib2

url = "http://jira.atlassian.com/rest/api/latest/issue/JRA-9"
data = json.load(urllib2.urlopen(url))

What we are doing here is essentially creating a JSON representation of the data in a python object. If you want to explore other ways to represent JSON in a python object I recommend taking a look at this page https://pythonspot.com/json-encoding-and-decoding-with-python/

Let’s print this and see what we have.

print data
This isn’t the whole output, but you get the idea of the structure

This isn’t the whole output, but you get the idea of the structure

Keep in mind, this isn’t a real JSON file, this is simply in a JSON like structure. If you don’t have nested objects, you might be able to convert this pretty easily to a csv (http://blog.appliedinformaticsinc.com/how-to-parse-and-convert-json-to-csv-using-python/).

Unfortunately this is not one of those cases. If you have a situation like this, I recommend taking the time to understand how the fields are nested within each other because that will inform how you want to pull the information out and store it in a csv (and maybe you don’t even want all the fields). If you are looking for inspiration, I recommend the following references:

https://stackoverflow.com/questions/1871524/how-can-i-convert-json-to-csv

https://stackoverflow.com/questions/40588852/pandas-read-nested-json

Most of these methods require a bit of hard coding on which fields you want, which means your code won’t be very flexible if you try to use it for other applications. There is a promising answer under the first link though, that describes how to create a function that will flatten JSON objects. Since I think both these stack overflow questions have answers that provide a lot of detail about what you can do (more than what I can provide), I’m not going to put an example here on how to convert pseudo JSON (pulled using JIRA API) into a csv.

Personally, I think dealing with pseudo JSON in R is easier than trying to deal with it in python. Especially if you want to visualize what the data looks like. There are good reasons though why you might want to work in python instead of R. Next post I will discuss then differences between python and R, and why you might want to use one over the other.

NOTE: You might also run into a situation where the data is actually stream of JSON like data. This might be a good resource if you have that situation:

https://stackoverflow.com/questions/19697846/how-to-convert-csv-file-to-multiline-json

Working with JSON in R

I’m not sure if it is just the type of projects I’ve been taking on lately, or if there is some cosmic force out to get me, but I’ve run into JSON (JavaScript Object Notation) a couple of times in the last month.

JSON has it roots in the late 1990s and early 2000s, and has the format of two different data structures: collection of name/value pairs and ordered list of values. Now you are probably saying at this moment “Let’s hurry it up to where I can get a usable data frame out of this” but I think it is useful to review what this actually means. It means that there can be objects nested as one of the values and you can get a somewhat tree-like structure of data (for more detail on JSON in general, Louis Lazaris has a particularly good page that explains it: https://www.impressivewebs.com/what-is-json-introduction-guide-for-beginners/).

I’m not really a fan of JSON and I’m not the only one with strong feelings, but it is the reality of the world that JSON exists and that we have to deal with data in that format.

Now R does have a few packages to help you parse out JSON data (mainly JSON, RJSONIO, and jsonlite, although it looks like JSON is not available for the most recent version of R). Now, it should be as simple as:

library(httr)
library(jsonlite)

rm(list=ls())

# Using URI path for JIRA's public issue tracker
json <- httr::GET("http://jira.atlassian.com/rest/api/latest/issue/JRA-9")
mydf <- fromJSON(json)

However if you are running this yourself, you will notice that instead of getting a data frame, you got an error.

Error: Argument 'txt' must be a JSON string, URL or file.

As far as I’ve been able to tell, this is a particular issue with JSON from JIRA, and not necessarily reflective of all JSON data. But it does mean we need to find a way around it. So I did what most people do. I googled. Google provided some possible (albeit painful) solutions, but what quickly became apparent is that this is an issue that is frustrating many people.

The conclusion I finally came to is that when using JIRA API the file I’m getting is not truly a JSON file. It is psuedo JSON file. Thanks JIRA… This is why we can’t have nice things. To be fair to JIRA, apparently it not uncommon to use API to get data, and end up with something containing JSON lines but is not actually a JSON array. If you want to take a look at the stack overflow post I ended up basing my solution on: https://stackoverflow.com/questions/26519455/error-parsing-json-file-with-the-jsonlite-package

library(jsonlite)

rm(list=ls())

# Using URL path for JIRA's public issue tracker
df <- jsonlite::stream_in(file("https://jira.atlassian.com/rest/api/latest/issue/JRA-9"), verbose=FALSE)

Viola! You have data. Albeit maybe somewhat messily named data, but we can’t win at everything. Unfortunately, this method is somewhat problematic if you need to use GET for username and password authentication. While I’m not sure it is the best solution, I have used the following method when pulling in psuedo JSON files using GET.

#avoid hardcoding username/password for security purposes
query <- httr:GET("http://host:port/rest/api/latest/search?jql=project="Project Name"
    &maxResults="num of Results you want to limit to"", authenticate("username","password","basic")

query <- httr:content(query, as="raw")
json <- jsonlite::fromJSON(rawToChar(query))
df <- flatten(as.data.frame(json))

As you can see, there was a lot of things jsonlite::stream_in was handling on the backend that we actually had to specify after pulling in data using httr:GET. Again, this is probably not the most elegant solution, but it gets the job done.

When not dealing with psuedo JSON files I do recommend this link: https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html

What is particularly nice about this page is that it walks through some steps that tend to get skipped over in most example solutions for converting JSON into a data frame.

Using JIRA API in R

Recently at work, a problem with reading data from JIRA via R came up and I thought I would write up some of my thoughts about it here.

For those of you who aren’t familiar with API, it stand for Application Programming Interface and most websites have some form of API. You can use API to directly talk to a server.

In this specific post, I want to talk directly with the JIRA servers. If you are reading this post then you are probably already familiar with JIRA but it you aren’t, it is essentially a software provided by Atlassian for “agile teams”. Putting aside my feelings about the overuse of terms agile and teams together, let’s dig into the API.

Atlassian does actually provide a page on JIRA API.

The REST here means Representational State Transfer, which really is just fancy language describing the architectural style of the API. If you want to learn more about REST I recommend looking it up as I won’t go in more detail about it here.

Now, there are several ways to pull data into R using JIRA API (curl, GET, etc.). In this particular example I will use GET.

library(httr)

# Using URI path for JIRA's public issue tracker
json <- httr::GET("http://jira.atlassian.com/rest/api/latest/issue/JRA-9")

I like using GET because I do have the option of using a username/password to authenticate access built in to the function. For obvious reasons though you should avoid hardcoding username/password information into your code. If you are having trouble finding your company/institute’s specific URI path, you will want something like this:

http://host:port/rest/api/2

OR

http://host:port/rest/api/latest

You can also specify if you want to look up a specific project and how many results (JIRA API will automatically restrict you to 50 issues but you can overwrite this default).

http://host:port/rest/api/latest/search?jql=project="Project Name"&maxResults="num of Results you want to limit to"

Just make sure to replace the "Project Name" and "num of Results you want to limit to"
(including the quotes) with the appropriate inputs

And viola, you have use JIRA API to bring data from JIRA into R. Probably right about now you are asking yourself, “Wait a minute, what is this? How am I suppose to use this?”. What you are looking at is JSON (JavaScript Object Notation) data. Unfortunately, pretty much whenever you use API you will get JSON data. And frankly, as someone in analytics, I find it an awful format. While I understand there are some benefits to using it (lightweight format, easy to parse/generate, etc.), I like working in data frames and data tables. Which means the rare moments I do encounter JSON, it is the bane of my exist.

My following post will be on what are some of the best ways to try to convert the JSON data into a usable data frame.

EDIT: I’ve reviewed this post and found that the links I originally provided for username and password authentication are not the most helpful. I’ve updated accordingly and provided an example below.

query <- httr:GET("http://host:port/rest/api/latest/search?jql=project="Project Name"
    &maxResults="num of Results you want to limit to"", authenticate("username","password","basic")

Obviously, it isn’t a great idea to hardcode a password and username into your code. I would recommend only using this method if you are prompting the user for username and password input to use in this query (and not actually saving the input). If you need more ideas on how to use password authentication with JIRA API I would recommend checking the following post:

https://community.atlassian.com/t5/Jira-questions/Passing-username-and-password-via-URL-to-jira/qaq-p/16679