Introduction

UPDATED April 2019

I obtained the Chicago Police Department’s gang database from ProPublica. It’s a dataset of CPD-classified as gang members. However, it was found to be full of errors and inaccuracies, including outdated information like gang members who are still active at the ripe old age of 132.

My questions for the data

What are the differences between the 2 gang databases provided? One file is dated November 2017 and the other is from March 2018, and it seems they contain slightly different information.
Glimpse and outline the variables in each dataset
Compare them to see where they overlap
Where in Chicago are the most folks classified as being in a gang, whether by arrest or by officer classification?
Map which police beats contain the most folks are classified by arrest or by officer classification.

Method

I made this gang database into a choropleth map of Chicago, which displays shading based on the number of individuals classified as gang members in each of Chicago’s police beats (a police beat is basically a patrol area assigned to a group of officers).

This map provides an accessible way to display how many people were put into the database and the police beat under which they were classified.

To see the map and conclusions, skip to the bottom! What follows is my data analysis process.

Data processing

Read in the two CPD Gang database files & the police beat shape file

Read in the first gang data file from March 2018 (CPD gang database 3-18.xlsx).

The data goes from March 1999 – March 2018.
This dataset contains:
Age
Race
Gang Name
“O Beat” (I’m taking this to mean police officer beat)
Date the entry was created (ranges from 1999 - 2018)
I’ve named it df_gang_318
This dataset APPEARS to be about officer-made gang classifications, since no mention of arrest is here.

Read in the second gang data file from November 2017 (CPD gang database 11-17.xlsx).

The data goes from 1984 – 2017.
This dataset contains:
Age
Race
Gang Name
Faction Name
Beat of First Arrest
First Arrest (this is a date that ranges from 1984 - 2017)
I’ve named it df_gang_1117
This dataset APPEARS to be about gang classifications coming from actual officer-made arrests.

Then we read in the shapefile of the CPD police beats, so we can map them.

#read in both databases
df_gang_318 <- read_excel("CPD Gang Data/CPD gang database 3-18.xlsx", sheet=1)
df_gang_1117 <- read_excel("CPD Gang Data/CPD gang database 11-17.xlsx", sheet=1)

# Set up the Chicago Police Beats Shapefile, downloaded from the Chicago data portal
map_filepath <- "Boundaries_Police Beats/geo_export_CPD_BEATS.shp"
cpd_beats <- st_read(map_filepath)

## Reading layer `geo_export_CPD_BEATS' from data source `/Users/PrincessO/code/CPD Gang database/Boundaries_Police Beats/geo_export_CPD_BEATS.shp' using driver `ESRI Shapefile'
## Simple feature collection with 277 features and 4 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs

After reading in the data, I’m cleaning it below by dropping any rows that don’t have a police beat listed for the officer who created the entry or the location of the arrest, since we can’t map police beats that aren’t in the dataset.

#Cleaning the data -- take out any rows with NA for a police beat since we can't map those 
df_gangs_318_ONLYBEATS <- filter(df_gang_318, !is.na(O_BEAT))
df_gangs_1117_ONLYBEATS <- filter(df_gang_1117, !is.na(BEAT_FIRST_ARREST))

Interestingly, after filtering out the nulls, I found that there seem to be a lot of rows/entries that don’t have a police beat.

For 3-18, removing rows without a police beat made the dataset go from 128836 rows to 89787 rows.
For 11-17, removing rows without a police beat made the dataset go from 128037 rows to 89771 rows.

So in both datasets, it seems there are about 39,000 entries in the database that have no police beat. This may be something to look into later!

Taking a quick look at the Police beats shapefile

To start, let’s take a look at the CPD Police Beats shapefile. This preliminary map includes a popup of the police beat, and will be the base for further exploration and mapping.

#this is just a quick map to map the police beats 

cpd_beats %>% 
  leaflet() %>% 
  addTiles() %>% 
  addPolygons(popup=~beat_num)

Standardizing/Cleaning the police beat number

Before we join the gang database to the location of the police beats, we need to standardize the police beat number. Both databases contain police beat numbers that are sometimes 3 or 4 digits. So this step just adds a “0” to the beginning of any 3-digit number, to keep everything uniform (Example: Police Beat “111” become “0111” after this step).

# it looks like both dfs need a padded 0 in front of some of the numbers to make them 4 digits instead of 3. So let's clean that up before we join it with cpd_beats (which is all 4 digits)

df_gangs_318_ONLYBEATS$O_BEAT <- str_pad(df_gangs_318_ONLYBEATS$O_BEAT, 4, pad = "0")
df_gangs_1117_ONLYBEATS$BEAT_FIRST_ARREST <- str_pad(df_gangs_1117_ONLYBEATS$BEAT_FIRST_ARREST, 4, pad = "0")

Joining the dataframes and creating a summary count for the map

Now that both datasets have a uniform police beat number, I can join them to the location of the police beats and map them out. I created a new dataframe from it with a summary of the number of arrests per police beat (num_arrests).

Steps below include:

Joining/combining the shapefile to the datasets
Counting the number of classifications/arrests from each police beat

#inner join the two data frames. Used an inner join to drop any rows where the two police beat numbers didn't match up

beats_plus_gangs_318<- inner_join(cpd_beats, df_gangs_318_ONLYBEATS, by=c('beat_num'='O_BEAT'))
beats_plus_gangs_1117<- inner_join(cpd_beats, df_gangs_1117_ONLYBEATS, by=c('beat_num'='BEAT_FIRST_ARREST'))

Looks like the inner join dropped a few thousand rows from each database because they didn’t match up. Making another note to investigate what got dropped because there was no match, since I’m thinking the cpd_beats file and the cleaning should have worked so that everything matches up. Maybe a few of the beats were entered wrong/user error? But an inner join dropping anywhere from 1k - 5k rows from a 85k row dataset doesn’t seem too bad or alarming.

#get the number of arrests per police beat and create a new data frame (CPD_gang_map2) with just that info

CPD_gang_map2 <- beats_plus_gangs_1117 %>%
  group_by(beat_num) %>%
  summarize(num_arrests=n())

Plotting the CPD data

Darker blue areas indicate more gang arrests, as indicated in the 11-17 database. Select the community area to view a popup of the police beat and the number of arrests.

#set the color pallette as Blues. 

col_pal <- colorNumeric("Blues", domain=CPD_gang_map2$num_arrests)

#Add the pop-up text with Police Beat and Number of Arrests in that area
popup_sb <- paste0("Police Beat: ", as.character(CPD_gang_map2$beat_num), "<br>Number of arrests: ", as.character(CPD_gang_map2$num_arrests))


# leaflet worked to build map1 

leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%
  setView(-87.628598, 41.855372, zoom = 10) %>% 
  addPolygons(data = CPD_gang_map2, 
              fillColor = ~col_pal(CPD_gang_map2$num_arrests), 
              fillOpacity = 0.7, 
              weight = 0.2, 
              smoothFactor = 0.2, 
              popup = ~popup_sb) %>%
  addLegend(pal = col_pal, 
            values = CPD_gang_map2$num_arrests, 
            position = "bottomright", 
            title = "Number of Arrests")

Note: I’ve used “Number of Arrests” in the map I created above, since I’ve only used data from the 11-17 database to create the map. The 11-17 database specifically refers to “arrests,” while the 3-18 database merely has the date the person was entered into the database and does not mention arrest as the classification means, so it’s unclear if they were entered because of an arrest or were classified as a gang member by other means.

Conclusions

I determined that the 11-17 database coded gang classification by the date of arrest, while the 3-18 database only coded gang classification by the date the record was created.
I was able to summarize the data to determine which police beats reported the most gang member arrests in the 11-17 database. Some notably high arrest levels:
Police Beat 1024 had 966 arrests
Police Beat 0423 had 983 arrests
Police Beat 0824 had 806 arrests
I choose a map view as easy way to look at the data and see what Chicago areas have the most gang arrests entered into the database. Each police beat is mapped and can be selected to see the number of arrests it contains.

Future exploration

There were a few entries (39k) in the database that didn’t have a police beat entered. It would be interesting to explore these to see if there is a reason or pattern to why someone is in the database without an corresponding beat/area in which they were classified. Does it match up with their gang or faction name? (Admitedly looking at this is pretty easy, but I just ran out of time on this project!)
I’d like to overlay a map of the 3-17 database onto the current map of the 11-18 database to explore if there are significant differences in gang classifications between the two databases.
I’d love to make a few charts of the differences between the two databases, maybe merging them on similar dates to see if they follow the same rates of classifications by day or month.

Mapping the CPD Gang Database

Princess Ojiaku

8/26/2018