How well do our police departments represent the populations they serve?

by James Marchant \ for CMSC320 fall 2020

From minor interactions like passing an officer on the sidewalk to being arrested, we frequently encounter the police. Do you ever stop to think if an officer looks like you?

I live near a police department often have cars fly by my house, never really sure how far they're going. Recently, Karon Hylton-Brown was hit by a car a few blocks from me. In June 2020, protests arose around the country--around the world, in fact--spurred by the death of George Floyd and fueled by the death of Breonna Taylor and too many others.

Police brutality is nothing new. Rodney King's beating spurred unrest in LA in '92. Countless songs, films, plays invoke police injustice in some capacity.

Let's take a step back. I asked, earlier, "Do you ever stop to think if an officer looks like you?" The police represent us in the sense that they exist to protect and serve, but how well do they represent us? I would expect police departments to reflect pretty closely the demographics they serve. It turns out, though, that many don't live in the same city they serve.

I can't say if this is a problem--if it affects how the police police--but we can take a look at whether this disparity exists, and, if it does, where it's at its worst, its best, and in-between.

Our goal: compare the demographics of police departments across the US to the counties they serve.

Let's take a look, shall we?

Getting Started

We'll start with a few imports. We'll use a few standard ones--pandas, sklearn, seaborn, json, urllib--and a few less typical ones--uszipcode, plotly.

All of these will be used in a pretty straightforward manner:

  • pandas for maintaining/modifying dataframes
  • sklearn for using a RandomForestClassifier to predict missing data
  • seaborn for general graphing
  • json for using a GeoJSON
  • urllib for grabbing our GeoJSON from a url
  • uszipcode to map zipcodes to counties
  • plotly for graphing a choropleth
In [1]:
# importing as necessary
import pandas as pd                 # for dataframes
import seaborn as sea               # for graphing
import json                         # for graphing with plotly
#!pip install uszipcode             # if uszipcode needs to be installed, uncomment this line!
from uszipcode import SearchEngine                  # for mapping zipcode to county
from sklearn.ensemble import RandomForestClassifier # for predicting missing data
#!pip install plotly                                # if plotly needs to be installed, uncomment this line!
import plotly.express as px                         # for creating a choropleth
from urllib.request import urlopen                  # used to grab our json
In [2]:
# Setting a few things up for usage later
sea.set(rc={'figure.figsize':(20,15)})         # modifying the default seaborn figure size 
zip_search = SearchEngine(simple_zipcode=True) # setting up a search engine for zipcodes for usage a bit later

Population Data

We'll get started with population data. This comes courtesy of the US Census Bureau. We'll be, specifically, using this csv which contains population estimates by county for age, race, sex and hispanic origin.

This csv contains a lot--a lot--of information, and we'll narrow it down quite a bit. But! First, let's open it and take a peek.

In [3]:
pop_data_file = 'cc-est2019-alldata.csv'    # name of csv file

Well, before we open it, a quick aside: the encoding must be set to latin or else pandas cannot read it.

In [4]:
pop_data = pd.read_csv(pop_data_file, encoding='latin')
In [5]:
pop_data.head()
Out[5]:
SUMLEV STATE COUNTY STNAME CTYNAME YEAR AGEGRP TOT_POP TOT_MALE TOT_FEMALE ... HWAC_MALE HWAC_FEMALE HBAC_MALE HBAC_FEMALE HIAC_MALE HIAC_FEMALE HAAC_MALE HAAC_FEMALE HNAC_MALE HNAC_FEMALE
0 50 1 1 Alabama Autauga County 1 0 54571 26569 28002 ... 607 538 57 48 26 32 9 11 19 10
1 50 1 1 Alabama Autauga County 1 1 3579 1866 1713 ... 77 56 9 5 4 1 0 0 2 1
2 50 1 1 Alabama Autauga County 1 2 3991 2001 1990 ... 64 66 2 3 2 7 2 3 2 0
3 50 1 1 Alabama Autauga County 1 3 4290 2171 2119 ... 51 57 13 7 5 5 2 1 1 1
4 50 1 1 Alabama Autauga County 1 4 4290 2213 2077 ... 48 44 7 5 0 2 2 1 3 1

5 rows × 80 columns

Most of this information is not particularly useful to us. The column names are unclear without the key, and will have to be renamed.

So what are we looking for?

Later, we'll be merging this dataframe with one containing police demographic data, so we need to remove information that is not contained within that dataframe (or which does not have a logical pair in the police dataframe). We only want to keep county, state, and population information for the year 2016.

We'll be removing quite a bit of, honestly, relevant information, unfortunately. The Census breaks down racial classification quite a bit more than we can use and allows for overlap between categories (which is important! but not usable here). We'll remove categories which do not focus on one race alone (e.g. black, white, hispanic) to limit overlap between categories. Additionally, when choosing columns, we'll use the "Not Hispanic" variables in order to better differentiate between each category/eliminate overlap where we can.

Below, we'll modify our dataframe to contain only relevant categories, and rename categories to a more readable format.

A couple of notes:

  1. categories in all caps after being renamed will be dropped;
  2. (ST/CTY)FIPS information classifies counties by numerical code (useful later!);
  3. Amer_In_AK_Nat_X represents American Indians (the preferred term for people native to the continental US), and Alaska Native (the preferred term for people native to Alaska);
  4. and Nat_HI_Pac_Isl_X represents Native Hawaiians or people native to other Pacific Islands.
In [6]:
# narrowing pop_data to relevant columns only
pop_data = pop_data[['STATE', 'COUNTY', 'STNAME', 'CTYNAME', 'YEAR', 'AGEGRP', 'TOT_POP', 
                     'TOT_MALE', 'TOT_FEMALE', 'NHWA_MALE', 'NHWA_FEMALE', 'NHBA_MALE', 
                     'NHBA_FEMALE', 'NHIA_MALE', 'NHIA_FEMALE', 'NHAA_MALE', 'NHAA_FEMALE', 
                     'NHNA_MALE', 'NHNA_FEMALE', 'H_MALE', 'H_FEMALE']]

# renaming pop_data columns 
pop_data.columns = ['STFIPS', 'CTYFIPS', 'State', 'County', 'YEAR', 'AGEGRP', 'Total_Pop', 
                    'Total_Male', 'Total_Fem', 'White_Male', 'White_Fem', 'Black_Male', 
                    'Black_Fem', 'Amer_In_AK_Nat_Male', 'Amer_In_AK_Nat_Fem', 'Asian_Male', 
                    'Asian_Fem', 'Nat_HI_Pac_Isl_Male', 'Nat_HI_Pac_Isl_Fem', 'Hispanic_Male', 
                    'Hispanic_Fem']

Great! What's next?

We'll map the full state names to there abbreviation. This will assist in our merge later on, as the police dataframe contains abbreviations only.

After that, we'll concatenate FIPS information, drop redundant & no longer useful information.

First, let's map each state to its abbreviation:

In [7]:
# dictionary mapping state name to its abbreviation 
states = {'Alabama' : 'AL', 'Alaska' : 'AK', 'Arizona' : 'AZ', 
          'Arkansas' : 'AK', 'California' : 'CA', 'Colorado' : 'CO', 
          'Connecticut' : 'CT', 'Delaware' : 'DE', 'District of Columbia' : 'DC', 
          'Florida' : 'FL', 'Georgia' : 'GA', 'Hawaii' : 'HI', 'Idaho' : 'ID', 
          'Illinois' : 'IL', 'Indiana' : 'IN', 'Iowa' : 'IA', 'Kansas' : 'KS', 
          'Kentucky' : 'KY', 'Louisiana' : 'LA', 'Maine' : 'ME', 'Maryland' : 'MD',
          'Massachusetts' : 'MA', 'Michigan' : 'MI', 'Minnesota' : 'MN', 
          'Mississippi' : 'MS', 'Missouri' : 'MO', 'Montana' : 'MT', 
          'Nebraska' : 'NE', 'Nevada' : 'NV', 'New Hampshire' : 'NH', 
          'New Jersey' : 'NJ', 'New Mexico' : 'NM', 'New York' : 'NY', 
          'North Carolina' : 'NC', 'North Dakota' : 'ND', 'Ohio' : 'OH', 
          'Oklahoma' : 'OK', 'Oregon' : 'OR', 'Pennsylvania' : 'PA', 
          'Rhode Island' : 'RI', 'South Carolina' : 'SC', 'South Dakota' : 'SD', 
          'Tennessee' : 'TN', 'Texas' : 'TX', 'Utah' : 'UT', 'Vermont' : 'VT', 
          'Virginia' : 'VA', 'Washington' : 'WA', 'West Virginia' : 'WV', 
          'Wisconsin' : 'WI', 'Wyoming' : 'WY'}

# find_state returns the abbreviation for the state of the supplied row
def find_state(row):
    return states[row['State']]

# by applying the above method, we're effectively mapping each state to its abbreviation
pop_data['State'] = pop_data.apply(find_state, axis=1)

Next, let's convert STFIPS and CTYFIPS to a more useable format. We'll convert each column to strings and pad them with 0s. All that's left after that is to concatenate them.

In [8]:
# converting STFIPS and CTYFIPS to strings
pop_data['STFIPS'] = pop_data['STFIPS'].astype(str)
pop_data['CTYFIPS'] = pop_data['CTYFIPS'].astype(str)
In [9]:
# padding as appropriate with 0s. FIPS is a five digit code--think zipcode for 
# counties--in which the first two digits represent the state and the remaining 
# three represent the county. 
pop_data['STFIPS'] = pop_data['STFIPS'].str.pad(width=2, side='left', fillchar='0')
pop_data['CTYFIPS'] = pop_data['CTYFIPS'].str.pad(width=3, side='left', fillchar='0')
In [10]:
# now that they're padded, we can concatenate to get our full FIPS category
pop_data['FIPS'] = pop_data['STFIPS'] + pop_data['CTYFIPS']

Now that we're done with FIPS, let's move on to removing redundant information and rearranging the information a bit. We no longer need STFIPS or CTFIPS for example. First, we'll sort by state just to group similar information near each other. This is more of an aesthetic decision than anything else. Next, we'll drop years other than 2016 and AGEGRPs other than 0 (which represents all age groups). Finally, we'll drop the YEAR, AGEGRP, STFIPS, CTYFIPS columns and reset the index.

In [11]:
pop_data.sort_values(by=['State'], inplace=True) # sorting by state value
In [12]:
pop_data = pop_data[pop_data['YEAR'] == 9]       # dropping years other than 2016 (9)
In [13]:
pop_data = pop_data[pop_data['AGEGRP'] == 0]     # dropping rows for specific AGEGRPs 
pop_data = pop_data.drop(columns=['YEAR', 'AGEGRP', 'STFIPS', 'CTYFIPS'])  # dropping columns which are no longer useful
In [14]:
pop_data.reset_index(drop=True, inplace=True) # resetting the index

Let's take a look! We've gone from 80 columns to 18 (!) and have removed quite a lot of rows in the process as well.

We're not totally done with this dataframe and will add & modify columns below, but, for now, it's in a much more usable format.

Before we move on, let's talk about a few issues in how I've chosen to drop data. By removing identification categories in general, it is almost certainly the case that people are being removed/are no longer represented. Initially, I'd opted to have an 'other' category, but the overlap was too significant, making a not too useful category. Additionally, I have a bit of a problem thinking about race as a one or the other concept. No person can be classified as one category alone, and racial classifications can differ from country-to-country (in the US, race is largely color-oriented, whereas in Russia, it is often based on region).

While I am largely focusing on race, it is absolutely worth noting, too, sex/gender information is extremely limited. There's no inclusion of trans, genderqueer, or nonbinary people. Instead, people are categorized as either male or female. Some demographic information on trans people (from 2016, in fact), can be found here and here, and some relevant demographic information on both nonbinary and trans people can be found here (though I have not fully read this & cannot totally account for it).

Still, the information we can glean by separating into individual categories will be useful. It just needs to be contextualized with the understanding that it does not present the full picture.

In [15]:
pop_data.head()
Out[15]:
State County Total_Pop Total_Male Total_Fem White_Male White_Fem Black_Male Black_Fem Amer_In_AK_Nat_Male Amer_In_AK_Nat_Fem Asian_Male Asian_Fem Nat_HI_Pac_Isl_Male Nat_HI_Pac_Isl_Fem Hispanic_Male Hispanic_Fem FIPS
0 AK Baxter County 41131 19844 21287 18801 20226 76 41 132 120 76 139 6 11 469 462 05005
1 AK Benton County 259212 128503 130709 93498 97736 2363 2033 1885 2003 5545 4707 699 706 21744 20614 05007
2 AK Ashley County 20530 9958 10572 6870 7111 2331 2796 28 33 18 26 2 2 588 493 05003
3 AK Bradley County 10959 5350 5609 2950 3145 1401 1616 15 13 10 16 2 3 914 753 05011
4 AK Boone County 37259 18377 18882 17292 17770 73 54 131 142 80 132 15 12 426 476 05009

Police Data

We'll be using a tsv fils containing police demographic information by city/department. This file comes from the Bureau of Justics Statistics, though I've downloaded it from here--the variables lookup tab is quite handy.

It's worth noting, before diving into the data, that similar files exist for other years as well. These police surveys are held every 3-4 years. Unfortunately, the only other one which falls in the same range as the population data about is from 2013 and does not contain the same breadth of information as the 2016 file.

Onto the file we're working with. Quite a bit of information is contained within this file, so we'll have to narrow it down significantly. Some of what we're removing contains information on training and resource allocation which would be quite useful in analzying crime in genneral.

However, as we're looking primarily at demographic representation, we'll remove that information. I would encourage anyone interested to investigate further by looking through the variable list above.

As before, we'll start by creating a dataframe.

In [16]:
police_data_file = '37323-0001-Data.tsv'  # file name as I've saved it
In [17]:
police_data = pd.read_csv(police_data_file, sep='\t') # as this is a tsv, it's tab-separated
In [18]:
police_data.head()
Out[18]:
LEAR_ID AGENCYNAME CITY ZIPCODE STATE COUNTY FIPS ORI9 POPSERVED POPGROUP ... NEW_TOT_HIRES NEW_TOT_SEP FINALWGT_NTH_NTS FLAG1 FLAG2 FLAG3 FLAG4 FLAG5 FLAG6 FLAG7
0 635592 DAVIS POLICE DEPARTMENT DAVIS 95618 CA YOLO 6113 CA0570100 68111 5 ... -9 -9 7.533981 1 0 1 0 0 0 0
1 645110 WEST NEW YORK POLICE WEST NEW YORK 7093 NJ HUDSON 34017 NJ0091200 53343 5 ... -9 -9 1.183673 1 0 1 0 0 0 0
2 631270 WESTOVER POLICE DEPARTMENT WESTOVER 26501 WV MONONGALIA 54061 WV0310400 4243 8 ... -9 -9 7.607534 1 0 0 0 0 0 0
3 631316 BARABOO POLICE DEPARTMENT BARABOO 53913 WI SAUK 55111 WI0570200 12173 7 ... -9 -9 7.225225 1 0 0 0 0 0 0
4 631684 OREGON POLICE DEPARTMENT OREGON 53575 WI DANE 55025 WI0137400 3334 8 ... -9 -9 7.347126 1 0 0 0 0 0 0

5 rows × 434 columns

434 columns is far too many, but before we narrow it down, let's add one more. Pers_Total is as it sounds: the total number of police officers. It's not totally all encompassing--only full-time officers are included--but we'll be working exclusively with full-time officers (more on that below)

In [19]:
police_data['Pers_Total'] = police_data['PERS_MALE'] + police_data['PERS_FEMALE']  # adding a category for total full-time officers

Our dataframe contains information on both part-time and full-time police staff. It does not, however, provide information on what constitutes part-time employment. I, for example, work one day a week as a barista and am part-time, but I used to work five days a week and was still considered part-time because I worked fewer than 36 hours. Because of this range, I am opting to include only full-time officers; that is, officers for whom policing must be a primary form of employment, even if they have other jobs.

We'll remove most information which does not have a clear match in our population dataframe. Some additional columns will be kept but not used, specifically AGENCYNAME and CITY. I've included those simply because they are relevant and could be used to take a deeper dive into a some state's demographic data.

At this point, it's worth mentioning that not all police departments are included in this dataframe. I do no know why--some may have been too small, some may have chosen not to respond to the survey--as such, it is likely the case that some counties will not have full police demographic data.

Ideally, we'll have access to the primary/largest police departments in each county, but it could absolutely be the case that we do not have that information. From here on, we will have to operate under the assumption that our county police data (as we'll compute it) is truly representative of the county. The impact could be minimal or could be substantial, but it's difficult to know for sure.

In [20]:
# removing irrelevant categories
police_data = police_data[['AGENCYNAME', 'CITY', 'ZIPCODE', 'STATE', 'COUNTY','Pers_Total', 'PERS_MALE', 
                                     'PERS_FEMALE', 'PERS_WHITE_MALE', 'PERS_WHITE_FEM', 'PERS_BLACK_MALE', 'PERS_BLACK_FEM', 
                                     'PERS_HISP_MALE', 'PERS_HISP_FEM', 'PERS_AMIND_MALE', 'PERS_AMIND_FEM', 
                                     'PERS_ASIAN_MALE', 'PERS_ASIAN_FEM', 'PERS_HAWPI_MALE', 'PERS_HAWPI_FEM']]

# renaming into a more readable format/a format aligned with our pop_data dataframe
police_data.columns = ['Agency', 'City', 'Zipcode', 'State', 'County', 'Total_Police', 'Total_P_Male', 'Total_P_Fem', 
                            'P_White_Male', 'P_White_Fem', 'P_Black_Male', 'P_Black_Fem', 'P_Hispanic_Male', 
                            'P_Hispanic_Fem', 'P_Amer_IN_AK_Nat_Male', 'P_Amer_In_AK_Nat_Fem', 
                            'P_Asian_Male', 'P_Asian_Fem', 'P_Nat_HI_Pac_Isl_Male', 
                            'P_Nat_HI_Pac_Isl_Fem']
In [21]:
police_data.head()
Out[21]:
Agency City Zipcode State County Total_Police Total_P_Male Total_P_Fem P_White_Male P_White_Fem P_Black_Male P_Black_Fem P_Hispanic_Male P_Hispanic_Fem P_Amer_IN_AK_Nat_Male P_Amer_In_AK_Nat_Fem P_Asian_Male P_Asian_Fem P_Nat_HI_Pac_Isl_Male P_Nat_HI_Pac_Isl_Fem
0 DAVIS POLICE DEPARTMENT DAVIS 95618 CA YOLO 60 54 6 43 6 1 0 4 0 0 0 6 0 0 0
1 WEST NEW YORK POLICE WEST NEW YORK 7093 NJ HUDSON 110 103 7 24 0 1 0 78 7 0 0 0 0 0 0
2 WESTOVER POLICE DEPARTMENT WESTOVER 26501 WV MONONGALIA 12 11 1 11 0 0 1 0 0 0 0 0 0 0 0
3 BARABOO POLICE DEPARTMENT BARABOO 53913 WI SAUK 29 25 4 24 4 0 0 0 0 0 0 0 0 0 0
4 OREGON POLICE DEPARTMENT OREGON 53575 WI DANE 18 16 2 16 2 0 0 0 0 0 0 0 0 0 0

We're close! But not there yet! County data is inomplete--not all counties are represented and the formatting is off from what we'd like. Next, we'll quickly lookup county information using zip_search.by_zipcode(zipcode) which returns a zipcode object & grabbing its county field.

In [22]:
# find_county uses the by_zipcode function to search for
# information on the provided zipcode. As we only want the 
# county info, we'll return the county field--a string
#
# luckily, the county field is identical to the pop. data's
# county field, making it easy to merge on them later.
def find_county(row):
    return zip_search.by_zipcode(row['Zipcode']).county
In [23]:
police_data['County'] = police_data.apply(find_county, axis=1) # modifying county data/removing null county information 

For whatever reason, some of our rows have negative values for total police and for total female police officers. As such, we'll remove those from our dataframe. It is not clear why these values are negative, so I feel it's better to exclude them entirely.

In [24]:
# removing negative demographic information
police_data = police_data[(police_data['Total_Police'] >= 0) & (police_data['Total_P_Fem'] >= 0)]

The below county was not mapped by zipcode--the zipcode object had no county data--so we'll set it manually ourselves. With that our county information is complete!

In [25]:
police_data.at[2590, 'County'] = 'Salt Lake County'

Grouping Police Data by County

As mentioned above, the data here may not be fully indicative of actual county police demographic data, but without access to more information, we will operate under the assumption that it is.

Before we can merge this with our population data, we'll need to convert it from city-level data to county-level data. This is pretty simple. We'll groupby state (sorry) and county and sum across each group. The resultant dataframe will be county-level police data.

Once we have our properly grouped dataframe, we can drop the zipcode column, as we'll no longer be needing it.

In [26]:
pol_county_data = police_data.groupby(['State', 'County']).sum() # creating county-level dataframe
In [27]:
pol_county_data.drop(columns=['Zipcode'], inplace=True)          # dropping zipcode column

Getting Ready for the Merge

Now that we've got our narrowed our police dataframe to the county-level, we're just about ready to merge! But first, we'll modify our columns such that each category is now a percentage. This pretty simple. Each category is just divided by the total number officers (e.g. total female = total female / total officers gives a percentage rather than a number.). Doing this puts the information in a more easily digestible format and makes it easier to compare to our population data later on.

Next, we'll add total population for each race by adding the male and female subcategories of each race (e.g. total hispanic = hispanic female + hispanic male). After that, our police data will be just about ready.

In [28]:
# converting each demographic category to a percentage
pol_county_data['Total_P_Male'] = pol_county_data['Total_P_Male'] / pol_county_data['Total_Police']
pol_county_data['Total_P_Fem'] = pol_county_data['Total_P_Fem'] / pol_county_data['Total_Police']
pol_county_data['P_White_Male'] = pol_county_data['P_White_Male'] / pol_county_data['Total_Police']
pol_county_data['P_White_Fem'] = pol_county_data['P_White_Fem'] / pol_county_data['Total_Police']
pol_county_data['P_Black_Male'] = pol_county_data['P_Black_Male'] / pol_county_data['Total_Police']
pol_county_data['P_Black_Fem'] = pol_county_data['P_Black_Fem'] / pol_county_data['Total_Police']
pol_county_data['P_Hispanic_Male'] = pol_county_data['P_Hispanic_Male'] / pol_county_data['Total_Police']
pol_county_data['P_Hispanic_Fem'] = pol_county_data['P_Hispanic_Fem'] / pol_county_data['Total_Police']
pol_county_data['P_Amer_In_AK_Nat_Male'] = pol_county_data['P_Amer_IN_AK_Nat_Male'] / pol_county_data['Total_Police']
pol_county_data['P_Amer_In_AK_Nat_Fem'] = pol_county_data['P_Amer_In_AK_Nat_Fem'] / pol_county_data['Total_Police']
pol_county_data['P_Asian_Male'] = pol_county_data['P_Asian_Male'] / pol_county_data['Total_Police']
pol_county_data['P_Asian_Fem'] = pol_county_data['P_Asian_Fem'] / pol_county_data['Total_Police']
pol_county_data['P_Nat_HI_Pac_Isl_Male'] = pol_county_data['P_Nat_HI_Pac_Isl_Male'] / pol_county_data['Total_Police']
pol_county_data['P_Nat_HI_Pac_Isl_Fem'] = pol_county_data['P_Nat_HI_Pac_Isl_Fem'] / pol_county_data['Total_Police']

# adding population categories for each represented race
pol_county_data['P_White_Pop'] = pol_county_data['P_White_Male'] + pol_county_data['P_White_Fem']
pol_county_data['P_Black_Pop'] = pol_county_data['P_Black_Male'] + pol_county_data['P_Black_Fem']
pol_county_data['P_Hispanic_Pop'] = pol_county_data['P_Hispanic_Male'] + pol_county_data['P_Hispanic_Fem']
pol_county_data['P_Amer_In_AK_Nat_Pop'] = pol_county_data['P_Amer_In_AK_Nat_Male'] + pol_county_data['P_Amer_In_AK_Nat_Fem']
pol_county_data['P_Asian_Pop'] = pol_county_data['P_Asian_Male'] + pol_county_data['P_Asian_Fem']
pol_county_data['P_Nat_HI_Pac_Isl_Pop'] = pol_county_data['P_Nat_HI_Pac_Isl_Male'] + pol_county_data['P_Nat_HI_Pac_Isl_Fem']
In [29]:
pol_county_data.reset_index(inplace=True) # resetting the index--mostly for aesthetic reasons

For our final modification, we'll replace any na values with 0. If no information was reported for a category, I am comfortable assuming that there was nothing to report (though, of course, this may not be the case).

In [30]:
pol_county_data.fillna(0, inplace=True)  # missing data treated as 0

Ready or Not

Here comes the merge! Or not! Turns out, we need to modify our population data a bit too. No sweat! All we need to do is modify it in the same we modified our police county data above.

We'll convert all demographic columns to percentages and add population columns for each demographic. In addition to that, we'll add a column for the majority population of each county.

In [31]:
# converting each demographic column to percentage of total population
pop_data['Total_Male'] = pop_data['Total_Male'] / pop_data['Total_Pop']
pop_data['Total_Fem'] = pop_data['Total_Fem'] / pop_data['Total_Pop']
pop_data['White_Male'] = pop_data['White_Male'] / pop_data['Total_Pop']
pop_data['White_Fem'] = pop_data['White_Fem'] / pop_data['Total_Pop']
pop_data['Black_Male'] = pop_data['Black_Male'] / pop_data['Total_Pop']
pop_data['Black_Fem'] = pop_data['Black_Fem'] / pop_data['Total_Pop']
pop_data['Hispanic_Male'] = pop_data['Hispanic_Male'] / pop_data['Total_Pop']
pop_data['Hispanic_Fem'] = pop_data['Hispanic_Fem'] / pop_data['Total_Pop']
pop_data['Amer_In_AK_Nat_Male'] = pop_data['Amer_In_AK_Nat_Male'] / pop_data['Total_Pop']
pop_data['Amer_In_AK_Nat_Fem'] = pop_data['Amer_In_AK_Nat_Fem'] / pop_data['Total_Pop']
pop_data['Asian_Male'] = pop_data['Asian_Male'] / pop_data['Total_Pop']
pop_data['Asian_Fem'] = pop_data['Asian_Fem'] / pop_data['Total_Pop']
pop_data['Nat_HI_Pac_Isl_Male'] = pop_data['Nat_HI_Pac_Isl_Male'] / pop_data['Total_Pop']
pop_data['Nat_HI_Pac_Isl_Fem'] = pop_data['Nat_HI_Pac_Isl_Fem'] / pop_data['Total_Pop']
In [32]:
# adding population columns for each demographic
pop_data['White_Pop'] = pop_data['White_Male'] + pop_data['White_Fem']
pop_data['Black_Pop'] = pop_data['Black_Male'] + pop_data['Black_Fem']
pop_data['Hispanic_Pop'] = pop_data['Hispanic_Male'] + pop_data['Hispanic_Fem']
pop_data['Amer_In_AK_Nat_Pop'] = pop_data['Amer_In_AK_Nat_Male'] + pop_data['Amer_In_AK_Nat_Fem']
pop_data['Asian_Pop'] = pop_data['Asian_Male'] + pop_data['Asian_Fem']
pop_data['Nat_HI_Pac_Isl_Pop'] = pop_data['Nat_HI_Pac_Isl_Male'] + pop_data['Nat_HI_Pac_Isl_Fem']

In order to add our majority population column, we'll construct a list by iterating over each row in our dataframe. All we'll do is check which population is greatest for any given county and append that to our list.

The final list will be our majority population column, and will be added to our pop_data dataframe.

In [33]:
maj_pop = []

for index, row in pop_data.iterrows():
        if (row['White_Pop'] > row['Black_Pop'] 
            and row['White_Pop'] > row['Hispanic_Pop']
            and row['White_Pop'] > row['Amer_In_AK_Nat_Pop']
            and row['White_Pop'] > row['Asian_Pop']
            and row['White_Pop'] > row['Nat_HI_Pac_Isl_Pop']):
            maj_pop.append('White')
        elif (row['Black_Pop'] > row['Hispanic_Pop'] 
              and row['Black_Pop'] > row['Amer_In_AK_Nat_Pop']
              and row['Black_Pop'] > row['Asian_Pop']
              and row['Black_Pop'] > row['Nat_HI_Pac_Isl_Pop']):
            maj_pop.append('Black')
        elif (row['Hispanic_Pop'] > row['Amer_In_AK_Nat_Pop']
              and row['Hispanic_Pop'] > row['Asian_Pop']
              and row['Hispanic_Pop'] > row['Nat_HI_Pac_Isl_Pop']):
            maj_pop.append('Hispanic')
        elif (row['Amer_In_AK_Nat_Pop'] > row['Asian_Pop']
              and row['Amer_In_AK_Nat_Pop'] > row['Nat_HI_Pac_Isl_Pop']):
            maj_pop.append('American Indian or Alaska Native')
        elif row['Asian_Pop'] > row['Nat_HI_Pac_Isl_Pop']:
            maj_pop.append('Asian')
        else:
            maj_pop.append('Native Hawaiian or Pacific Islander')
            
pop_data['Maj_Pop'] = maj_pop    

A Glance Before the Merge

Let's take a look at our dataframes before we merge. We can see, now, how each column in one dataframe corresponds to another in the other dataframe. Our counties don't totally line up, either because certain counties aren't present or simply aren't visible.

Still, we can see, from our brief look at the head of our population data, that demographics tend to split pretty evenly between male and female, with a female tending to be the larger of the two demographics. Our police data (representing different counties, admittedly) skews largely male. Greater than 80 percent of police officers in each county shown in our sample are male. That's wild! We'll need to look deepr.

But first, take a look at Matanuska-Susitna Borough County in the pol_county_data. It is missing all racial demographic information. We could drop the row, as it will clearly be an outlier in our representations of race. However, its population is relatively small, and I would like to use its sex demographics. I am willing to let it skew our already skewed data a bit by letting it stay. Should I? Who knows?

In [34]:
pop_data.head()
Out[34]:
State County Total_Pop Total_Male Total_Fem White_Male White_Fem Black_Male Black_Fem Amer_In_AK_Nat_Male ... Hispanic_Male Hispanic_Fem FIPS White_Pop Black_Pop Hispanic_Pop Amer_In_AK_Nat_Pop Asian_Pop Nat_HI_Pac_Isl_Pop Maj_Pop
0 AK Baxter County 41131 0.482458 0.517542 0.457100 0.491746 0.001848 0.000997 0.003209 ... 0.011403 0.011232 05005 0.948846 0.002845 0.022635 0.006127 0.005227 0.000413 White
1 AK Benton County 259212 0.495745 0.504255 0.360701 0.377050 0.009116 0.007843 0.007272 ... 0.083885 0.079526 05007 0.737751 0.016959 0.163411 0.014999 0.039551 0.005420 White
2 AK Ashley County 20530 0.485046 0.514954 0.334632 0.346371 0.113541 0.136191 0.001364 ... 0.028641 0.024014 05003 0.681003 0.249732 0.052655 0.002971 0.002143 0.000195 White
3 AK Bradley County 10959 0.488183 0.511817 0.269185 0.286979 0.127840 0.147459 0.001369 ... 0.083402 0.068711 05011 0.556164 0.275299 0.152112 0.002555 0.002372 0.000456 White
4 AK Boone County 37259 0.493223 0.506777 0.464103 0.476932 0.001959 0.001449 0.003516 ... 0.011433 0.012775 05009 0.941034 0.003409 0.024209 0.007327 0.005690 0.000725 White

5 rows × 25 columns

In [35]:
pol_county_data.head()
Out[35]:
State County Total_Police Total_P_Male Total_P_Fem P_White_Male P_White_Fem P_Black_Male P_Black_Fem P_Hispanic_Male ... P_Asian_Fem P_Nat_HI_Pac_Isl_Male P_Nat_HI_Pac_Isl_Fem P_Amer_In_AK_Nat_Male P_White_Pop P_Black_Pop P_Hispanic_Pop P_Amer_In_AK_Nat_Pop P_Asian_Pop P_Nat_HI_Pac_Isl_Pop
0 AK Anchorage Municipality 388 0.853093 0.146907 0.695876 0.110825 0.038660 0.010309 0.043814 ... 0.010309 0.002577 0.0 0.018041 0.806701 0.048969 0.054124 0.023196 0.046392 0.002577
1 AK Juneau City and Borough 47 0.893617 0.106383 0.744681 0.063830 0.021277 0.021277 0.021277 ... 0.021277 0.021277 0.0 0.063830 0.808511 0.042553 0.021277 0.063830 0.042553 0.021277
2 AK Matanuska-Susitna Borough 25 0.960000 0.040000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 AK Northwest Arctic Borough 12 0.833333 0.166667 0.750000 0.083333 0.083333 0.000000 0.000000 ... 0.000000 0.000000 0.0 0.000000 0.833333 0.083333 0.000000 0.083333 0.000000 0.000000
4 AL Autauga County 81 0.938272 0.061728 0.851852 0.024691 0.086420 0.037037 0.000000 ... 0.000000 0.000000 0.0 0.000000 0.876543 0.123457 0.000000 0.000000 0.000000 0.000000

5 rows × 24 columns

C'Mon! Let's Merge Already

We are all set to merge. Using the pandas merge method, we'll do an inner join State and County. Any non-matched State/County pair will be dropped from our merged dataframe, but all matched data will remain. Neat.

This is where the fun begins (or continues).

In [36]:
merged_data = pd.merge(pop_data, pol_county_data, on=['State', 'County']) # merging on State and County columns
In [37]:
merged_data.head()
Out[37]:
State County Total_Pop Total_Male Total_Fem White_Male White_Fem Black_Male Black_Fem Amer_In_AK_Nat_Male ... P_Asian_Fem P_Nat_HI_Pac_Isl_Male P_Nat_HI_Pac_Isl_Fem P_Amer_In_AK_Nat_Male P_White_Pop P_Black_Pop P_Hispanic_Pop P_Amer_In_AK_Nat_Pop P_Asian_Pop P_Nat_HI_Pac_Isl_Pop
0 AK Northwest Arctic Borough 7691 0.540762 0.459238 0.067742 0.041997 0.004681 0.005201 0.411650 ... 0.000000 0.000000 0.0 0.000000 0.833333 0.083333 0.000000 0.083333 0.000000 0.000000
1 AK Anchorage Municipality 297249 0.509980 0.490020 0.307231 0.280169 0.029218 0.024794 0.037390 ... 0.010309 0.002577 0.0 0.018041 0.806701 0.048969 0.054124 0.023196 0.046392 0.002577
2 AK Matanuska-Susitna Borough 104119 0.522354 0.477646 0.420500 0.379431 0.006377 0.004312 0.032674 ... 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 AK Juneau City and Borough 32412 0.510243 0.489757 0.335863 0.314390 0.007127 0.004720 0.054177 ... 0.021277 0.021277 0.0 0.063830 0.808511 0.042553 0.021277 0.063830 0.042553 0.021277
4 AL Dale County 49362 0.493659 0.506341 0.349398 0.344860 0.092318 0.106256 0.003505 ... 0.000000 0.000000 0.0 0.000000 0.250000 0.250000 0.000000 0.000000 0.000000 0.000000

5 rows × 47 columns

Visualization Fun

We obviously have quite a few more columns now, but that's OK. We've named our columns such that it's easy enough to remember how they correspond to one another--adding a 'P' says it's police info.

Now that we've assembled our information as we'd like, let's move on to visualizing.

After, of course, we add another category.

Sex Difference is a measure of how much each demographic's police representation differs from its population percentage. In order to calculate this, we'll take the absolute value of the difference between the percentage of total population and percentage of police officers and weight it by each demographics percentage of total population. We'll multiply be 100 to make it more readable.

Importantly, difference is always positive, as it's summed by demographic (in this case, male and female). If we allowed for negative differences, it would be possible for positive & negative differences to cancel one another out.

Another way to think about this is just as the difference in percentage of population (total vs police) weighted by total population percentage.

In [38]:
# adding a sex difference category, which can be used to visualize representation by gender. 
merged_data['Sex_Difference'] = ((100 * abs(merged_data['Total_Male'] - merged_data['Total_P_Male']) 
                                  * merged_data['Total_Male']) 
                                 + (100 * abs(merged_data['Total_Fem'] - merged_data['Total_P_Fem']) 
                                  * merged_data['Total_Fem']))

Visualizing Representation by Sex Demographic

Let's start with a scatterplot. We'll map points by FIPS county code and sex difference. This will be an extremely crowded x-axis. It is important to remember, too, that FIPS will be sorted by state & county. Points near each other on the x-axis are possibly in the same state.

We'll think more about our plot later.

First, let's plot.

We'll use seaborn to plot our scatterplot. I've imported seaborn as sea. I think it's typically imported as sns, but I read that as sneeze and prefer sea for that silly reason.

We'll set our x-axis as 'FIPS', y-axis as 'Sex_Difference' and hue as 'Maj_Pop' which we found earlier. By using it as a hue, we get a visual sense for how different majority populations are represented by their police using sex difference (and we'll do this again using race later).

In [39]:
# plotting sex difference by FIPS county code
sea.scatterplot(x = 'FIPS', y = 'Sex_Difference', hue='Maj_Pop', 
                data = merged_data).set(title = 'Sex Difference by County Code (FIPS)', 
                                        ylabel = 'Difference in Representation by Sex',
                                        xlabel = 'FIPS County Code')
Out[39]:
[Text(0, 0.5, 'Difference in Representation by Sex'),
 Text(0.5, 0, 'FIPS County Code'),
 Text(0.5, 1.0, 'Sex Difference by County Code (FIPS)')]

Let's do this again, but this time with 'State' as our x-axis instead of 'FIPS'. this will give us an easier to interpret plot, hopefully.

You'll notice that points align pretty closely between plots, as state order and FIPS align.

In [40]:
# plotting sex_difference by state
sea.scatterplot(x = 'State', y = 'Sex_Difference', hue='Maj_Pop', 
                data = merged_data).set(title = 'Sex Difference by State', 
                                        ylabel = 'Difference in Representation by Sex',
                                        xlabel = 'State')
Out[40]:
[Text(0, 0.5, 'Difference in Representation by Sex'),
 Text(0.5, 0, 'State'),
 Text(0.5, 1.0, 'Sex Difference by State')]

What does this show us?

Well, not too much relationally. There's no clear relationship between sex and representation. It seems that relatively few counties represent the sex demographics of their constituents well. More often than not, the difference is in the range of 30 to 50, demonstrating relatively poor representation.

This doesn't show us any information on things like population size or which demographic is being overly represented.

What we do know, though, is that this is pretty universal across racial demographics, at least in terms of majority population. We don't know how race impacts sex representation (e.g. are black women typically better represented than Hispanic women?), and we'll primarily be operating on sex and race distinctly. Still, it's important to keep in mind.

Working on Our Relations

Let's see if we can find any clear relationship between population and sex data. Instead of plotting difference against state let's plot corresponding categories against one another and see if we can learn anything useful.

For this, we'll use a lmplot, again with seaborn.

For our first plot, let's plot male police officers against male population percentage. Setting logistic to True will fit a logistic regression model, giving us a decent visual indicator of any relationship.

In [41]:
# plotting police male percentage by total male percentage
sea.lmplot(y = 'Total_P_Male', x = 'Total_Male', logistic=True, 
           data = merged_data, height=10, aspect=2).set(title = 'Police Percentage Male vs. Pop. Percentage Male', 
                                                                       ylabel = 'Police Percentage Male',
                                                                       xlabel = 'Population Percenage Male')
Out[41]:
<seaborn.axisgrid.FacetGrid at 0x7f6cbeaae460>

Let's do the same as the above but with female population data for police and population.

In [42]:
# plotting police female percentage by total female percentage
sea.lmplot(y = 'Total_P_Fem', x = 'Total_Fem',  logistic=True, 
           data = merged_data, height=10, aspect=2).set(title = 'Police Percentage Female vs. Pop. Percentage Female', 
                                                                       ylabel = 'Police Percentage Female',
                                                                       xlabel = 'Population Percenage Female')
Out[42]:
<seaborn.axisgrid.FacetGrid at 0x7f6cbee46d60>

Reflecting on Sex Demographic Visualizations

This is much more useful than our above scatter plots. Both plots show a pretty clear relationship in which, as each demographic increases, police representation (i.e. the percentage of police officers of the same demographic) increases. Notably. the range is wildly different for each. Female representation rarely ventures above 20%; conversely, male representation rarely ventures below 80%.

What does this tell us, and why is this important? Well, it tells us that women are far less likely to be well-represented by their police force. More often than not, men will police women (and men) regardless of population dynamics.

This is not new information as it turns out, but it is important. The information we have does not extend itself well to answering the why--why are men more likely to be police officers than women. It may have to do with on the job experiences, societal structures encouraging men to be authoritative and women to be subservient, or anything else (all of the above!). We do not have enough information here to make a reasonable assessment, but our finding is still important because it provides motive to research further.

If we know that a gap exists, it is reasonable to seek answers.

Visualizing Racial Demographics

Unlike with sex demographics, we won't create scatter plots by FIPS or by State (we'll have a better visualization tool later on). Instead, let's break racial demographics apart and plot lmplots in much the same way we did for sex.

The goal of these plots is to simply give us a relationship between police representation as a function of demographic population.

In [43]:
# plotting white police percentage by white population percentage
sea.lmplot(y = 'P_White_Pop', x = 'White_Pop',  logistic=True, 
           data = merged_data, height=10, aspect=2).set(title = 'Police Percentage White vs. Pop. Percentage White', 
                                                        ylabel = 'Police Percentage White',
                                                        xlabel = 'Population Percenage White')
Out[43]:
<seaborn.axisgrid.FacetGrid at 0x7f6cc1ac4ee0>
In [44]:
# plotting black police percentage by black population percentage
sea.lmplot(y = 'P_Black_Pop', x = 'Black_Pop', logistic=True, 
           data = merged_data, height=10, aspect=2).set(title = 'Police Percentage Black vs. Pop. Percentage Black', 
                                                        ylabel = 'Police Percentage Black',
                                                        xlabel = 'Population Percenage Black')
Out[44]:
<seaborn.axisgrid.FacetGrid at 0x7f6cc1aa6550>
In [45]:
# plotting Hispanic police percentage by Hispanic population percentage
sea.lmplot(y = 'P_Hispanic_Pop', x = 'Hispanic_Pop', logistic=True, 
           data = merged_data, height=10, aspect=2).set(title = 'Police Percentage Hispanic vs. Pop. Percentage Hispanic', 
                                                        ylabel = 'Police Percentage Hispanic',
                                                        xlabel = 'Population Percenage Hispanic')
Out[45]:
<seaborn.axisgrid.FacetGrid at 0x7f6cc1ebe1f0>
In [46]:
# plotting American Indian or Alaska Native police percentage by American Indian or Alaska Native population percentage
sea.lmplot(y = 'P_Amer_In_AK_Nat_Pop', x = 'Amer_In_AK_Nat_Pop', logistic=True, 
           data = merged_data, height=10, aspect=2).set(title = 'Police Percentage American Indian or Alaska Native vs. Pop. Percentage American Indian or Alaska Native', 
                                                        ylabel = 'Police Percentage American Indian or Alaska Native',
                                                        xlabel = 'Population Percenage American Indian or Alaska Native')
Out[46]:
<seaborn.axisgrid.FacetGrid at 0x7f6cc1ebe520>
In [47]:
# plotting Asian police percentage by Asian population percentage
sea.lmplot(y = 'P_Asian_Pop', x = 'Asian_Pop', logistic=True, 
           data = merged_data, height=10, aspect=2).set(title = 'Police Percentage Asian vs. Pop. Percentage Asian', 
                                                        ylabel = 'Police Percentage Asian',
                                                        xlabel = 'Population Percenage Asian')
Out[47]:
<seaborn.axisgrid.FacetGrid at 0x7f6cc1da5eb0>
In [48]:
# plotting Native Hawaiian or Other Pacific Island police percentage by Native Hawaiian or Other Pacific Island population percentage
sea.lmplot(y = 'P_Nat_HI_Pac_Isl_Pop', x = 'Nat_HI_Pac_Isl_Pop', logistic=True, 
           data = merged_data, height=10, aspect=2).set(title = 'Police Percentage Native Hawaiian or Other Pacific Island vs. Pop. Percentage Native Hawaiian or Other Pacific Island', 
                                                        ylabel = 'Police Percentage Native Hawaiian or Other Pacific Island',
                                                        xlabel = 'Population Percenage Native Hawaiian or Other Pacific Island')
Out[48]:
<seaborn.axisgrid.FacetGrid at 0x7f6cc1fe9c10>

What Would We Expect? What Do We See?

In a perfect representation, we would expect to see a one-to-one relationship between each police demographic and each population demographic. Instead, we see quite a bit of disparate representation relationships.

Starting with the relationship between white demographics, we can see that even when barely represented (if at all) in the total population, white people make up around 20% of the police force. In fact, as population increases, white police demographics tend to be significantly higher than their population representation, at least until the white population is a sizable majority. As with males in sex demographics, we can see that white people are disproportionately represented by the police.

In each other demographic, the relationship is pretty much the opposite. As their population increases, they are represented better by the police department, but the population percentage is almost always greater than the police percentage.

Black people are always underrepresented, sometimes significantly. Even in populations in which black people make up the vast majority of the population, they are underrepresented by their police forces.

Hispanic people, interestingly enough, have a near-linear relationship from around 60% total population on.

For American Indian or Alaska Native, Asian and Native Hawaiian or Other Pacific Island, we see mostly underrepresentation, with Asian people being the best represented.

This is not the one-to-one relationship we expected. In fact, unless you're white, you're likely to be underrepresented, sometimes significantly. This is a clear problem. Police departments have now been shown to both underrepresent non-male people and now non-white people as well. In essence, if you are not a white male, you are almost certainly poorly represented by your police force.

I Lied Early; Sorry About That

Yes, I lied. We are in fact going to plot race demographic differences in much the same way that we did sex demographic differences. We're doing this simply as a quick visualization tool--a way to see how much variation we see in representation before we move into the more interesting visualization tools.

First, let's create a 'Race_Difference' column in our merged_data DataFrame. This will be done the same way that we computed our 'Sex_Difference' column, using race population data instead of sex population data

In [49]:
merged_data['Race_Difference'] = ((100 * abs(merged_data['White_Pop'] - merged_data['P_White_Pop']) 
                                   * merged_data['White_Pop'])
                                + (100 * abs(merged_data['Black_Pop'] - merged_data['P_Black_Pop']) 
                                   * merged_data['Black_Pop'])
                                + (100 * abs(merged_data['Hispanic_Pop'] - merged_data['P_Hispanic_Pop']) 
                                   * merged_data['Hispanic_Pop'])
                                + (100 * abs(merged_data['Amer_In_AK_Nat_Pop'] - merged_data['P_Amer_In_AK_Nat_Pop']) 
                                   * merged_data['Amer_In_AK_Nat_Pop'])
                                + (100 * abs(merged_data['Asian_Pop'] - merged_data['P_Asian_Pop']) 
                                   * merged_data['Asian_Pop'])
                                + (100 * abs(merged_data['Nat_HI_Pac_Isl_Pop'] - merged_data['P_Nat_HI_Pac_Isl_Pop']) 
                                   * merged_data['Nat_HI_Pac_Isl_Pop']))

Now that we have that, let's take a moment to visualize race difference by FIPS and by state, as we did earlier.

In [50]:
# plotting race_difference by FIPS
sea.scatterplot(x = 'FIPS', y = 'Race_Difference', hue='Maj_Pop', 
                data = merged_data).set(title = 'Race Difference by State', 
                                        ylabel = 'Difference in Representation by Race',
                                        xlabel = 'State')
Out[50]:
[Text(0, 0.5, 'Difference in Representation by Race'),
 Text(0.5, 0, 'State'),
 Text(0.5, 1.0, 'Race Difference by State')]
In [51]:
# plotting race_difference by State
sea.scatterplot(x = 'State', y = 'Race_Difference', hue='Maj_Pop', 
                data = merged_data).set(title = 'Race Difference by State', 
                                        ylabel = 'Difference in Representation by Race',
                                        xlabel = 'State')
Out[51]:
[Text(0, 0.5, 'Difference in Representation by Race'),
 Text(0.5, 0, 'State'),
 Text(0.5, 1.0, 'Race Difference by State')]

What can we glean from this?

While racial differences certainly exist, there is less of an outright glaring disparity. But, as before, we'll need to dig a little deeper before claiming too much.

Let's plot race difference by demographic populations. This will hopefully give us greater insight into how well different demographics are represented. To do this, we'll go back to our old friend the lmlplot, using 'X_Pop' as our x-axis and 'Race_Difference' as our y-axis.

Remember our perfect representation world as we go through each plot. A positive correlation implies worsening representation as demographic population increases, a negative correlation the opposite.

In [52]:
sea.lmplot(x = 'Black_Pop', y = 'Race_Difference', data = merged_data, 
           height=10, aspect=2).set(title = 'Race Difference as Black Demographic Percentage Increases', 
                                    ylabel = 'Race Difference',
                                    xlabel = 'Black Demographic Percentage of Total Pop.')
Out[52]:
<seaborn.axisgrid.FacetGrid at 0x7f6cc37f7d00>
In [53]:
sea.lmplot(x = 'Hispanic_Pop', y = 'Race_Difference', data = merged_data, 
           height=10, aspect=2).set(title = 'Race Difference as Hispanic Demographic Percentage Increases', 
                                    ylabel = 'Race Difference',
                                    xlabel = 'Hispanic Demographic Percentage of Total Pop.')
Out[53]:
<seaborn.axisgrid.FacetGrid at 0x7f6cc34b32b0>
In [54]:
sea.lmplot(x = 'White_Pop', y = 'Race_Difference', data = merged_data, 
           height=10, aspect=2).set(title = 'Race Difference as White Demographic Percentage Increases', 
                                    ylabel = 'Race Difference',
                                    xlabel = 'White Demographic Percentage of Total Pop.')
Out[54]:
<seaborn.axisgrid.FacetGrid at 0x7f6cbf6a5a30>
In [55]:
sea.lmplot(x = 'Amer_In_AK_Nat_Pop', y = 'Race_Difference', data = merged_data, 
           height=10, aspect=2).set(title = 'Race Difference as American Indian or Alaska Native Demographic Percentage Increases', 
                                    ylabel = 'Race Difference',
                                    xlabel = 'American Indian or Alaska Native Demographic Percentage of Total Pop.')
Out[55]:
<seaborn.axisgrid.FacetGrid at 0x7f6cc1cd7f10>
In [56]:
sea.lmplot(x = 'Asian_Pop', y = 'Race_Difference', data = merged_data, 
           height=10, aspect=2).set(title = 'Race Difference as Asian Demographic Percentage Increases', 
                                    ylabel = 'Race Difference',
                                    xlabel = 'Asian Demographic Percentage of Total Pop.')
Out[56]:
<seaborn.axisgrid.FacetGrid at 0x7f6cc2453040>
In [57]:
sea.lmplot(x = 'Nat_HI_Pac_Isl_Pop', y = 'Race_Difference', data = merged_data, 
           height=10, aspect=2).set(title = 'Race Difference as Native Hawaiian or Other Pacific Island Demographic Percentage Increases', 
                                    ylabel = 'Race Difference',
                                    xlabel = 'Native Hawaiian or Other Pacific Island Demographic Percentage of Total Pop.')
Out[57]:
<seaborn.axisgrid.FacetGrid at 0x7f6cc1cd77c0>

Demographic Representation, Disparity, Race

Some of the above graphs are not quite as useful (namely Native Hawaiian or Other Pacific Island and American Indian or Alaska Native for which high population data is scarce), so I won't go onto their regression plots.

Let's sink a bit more deeply into our black, white, Hispanic and Asian plots. In all but our white plot, we can see that as the given demographic population increases, so too does racial disparity in police representation. The only demographic for which this is not the case is for white people, for whom racial disparity in representation decreases as population increases.

This suggests that non-white demographics are poorly represented even in counties with higher concentration of any given non-white demographic. Due to our previous graphs, this is a bit expected. Poorly represented people continue to be shown to be poorly represented. What we've gleaned, however, is that this does not change as a demographic's population increases--representation (and it's important to mention here that race difference does not necessarily show the difference of the race it's plotted against) in fact seems to worsen for non-white demographics.

Can We Map It? Let's Map It

Why use a scatter plot with FIPS when we can map it by county! In order to make this map a bit clearer, we'll first computer the standard deviation and mean of the Race_Difference category. We'll then add a Z_score column for race using those two values.

We'll use plotly to create a choropleth, mapping z_score to each county.

In [58]:
# computing mean and standard deviation using the .mean() and .std() functions
race_dif_mean = merged_data['Race_Difference'].mean()
race_dif_stddev = merged_data['Race_Difference'].std()

# z_score = (val - mean) / stddev
# rounding, casting as int for use later on
merged_data['Z_Score_Race'] = round((merged_data['Race_Difference'] - race_dif_mean) / race_dif_stddev).astype(int)  
In [59]:
# let's do the same thing for sex as well. 
# we'll compute the mean and standard deviation for sex_difference
sex_dif_mean = merged_data['Sex_Difference'].mean()
sex_dif_stddev = merged_data['Sex_Difference'].std()

# z_score for sex_difference
merged_data['Z_Score_Sex'] = round((merged_data['Sex_Difference'] - sex_dif_mean) / sex_dif_stddev).astype(int)  

Luckily for us, plotly included a county fips GeoJSON in their choropleth tutorial. We can use this to map our FIPS county codes against their provided JSON.

We'll open the url and load the json.

In [60]:
# opening the counties-fips geojson for use in mapping our choropleths 
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)
In [61]:
# setting location to FIPS, color to our race z_score, color range -> std dev range
z_score_race_fig = px.choropleth(merged_data, geojson=counties, locations='FIPS', color='Z_Score_Race',
                           color_continuous_scale='OrRd',
                           range_color=(-3, 3),
                           scope='usa',
                           labels={'Z_Score_Race':'Z Score (Race)'}
                          )
In [62]:
z_score_race_fig.update_layout(title_text = 'Z Score for Race by County', margin={'r':0,'t':0,'l':0,'b':0})
z_score_race_fig.show()
In [63]:
# setting location to FIPS, color to our sex z_score, color range -> std dev range
z_score_sex_fig = px.choropleth(merged_data, geojson=counties, locations='FIPS', color='Z_Score_Sex',
                           color_continuous_scale='OrRd',
                           range_color=(-3, 3),
                           scope='usa',
                           labels={'Z_Score_Sex':'Z Score (Sex)'},
                          )
z_score_sex_fig.update_layout(title_text = 'Z Score for Sex by County', margin={'r':0,'t':0,'l':0,'b':0})
z_score_sex_fig.show()

Choropleths? So What!

Now that we can visualize Z score by county, we can visualize not only how representation differs but where!

One thing to keep in mind: our range starts at 0, so the lower the Z score, the better the representation.

Unfortunately, we're missing quite a bit of data, and that's no good. We can see that many counties are not represented--much of the map is empty. Let's predict their values, and map them.

A note: I could not add titles to my maps. I tried; I failed.

In [64]:
# y value for race predictions
y_race = round(merged_data['Z_Score_Race']).astype(int).values

# y value for sex predictions
y_sex = round(merged_data['Z_Score_Sex']).astype(int).values

# the same X will be used to predict both
X = merged_data[['Total_Pop', 'White_Male', 'White_Fem', 'Black_Male', 'Black_Fem', 
                 'Amer_In_AK_Nat_Male', 'Amer_In_AK_Nat_Fem', 'Asian_Male', 'Asian_Fem', 
                 'Nat_HI_Pac_Isl_Male', 'Nat_HI_Pac_Isl_Fem', 'Hispanic_Male', 
                 'Hispanic_Fem']].values

Missing Data? Where?

We'll be using a RandomForestClassifier to predict missing data.

Something very important: I am going to do something that I likely should not.

Because so many counties are missing, I am not going to split my classification into training and testing datasets. Instead, our full, present dataset will be passed, fitted to the classifier. The classifier is not being cross-validated, so it's tough to say how accurate it actually is--and, indeed, whether it's worth using at all.

Prior to this, I did use a train, test split but could not get greater than ~60 percent accuracy. I can't test how accurate my current model is because it is fit to all known values.

Anyway, let's dive-in, keeping in mind that the prediction could very well be inaccurate.

In [65]:
# random forest classifier, fitting first with race z scores
rf_clf = RandomForestClassifier(max_depth=50, random_state=0)
rf_clf.fit(X, y_race)
Out[65]:
RandomForestClassifier(max_depth=50, random_state=0)

Now, we need to know which counties are not present in our initial visualization.

To do that, we'll use isin--any FIPS present in both pop_data and merged_data will be dropped, and the resulting dataframe will be assigned to missing_data, our fancy new dataframe of unrepresented counties.

In [66]:
# constructing a dataframe of non-represented counties
missing_data = pop_data.drop(pop_data[pop_data['FIPS'].isin(merged_data['FIPS'])].index)

Now, we need to predict z scores for both race and sex, and add them to our missing data dataframe.

To do this, we'll simply pass appropriate columns to rf_clf.predict(), and store the predicted z scores in an appropriate column.

In [67]:
# predicting z_score for race
missing_data['Z_Score_Race'] = rf_clf.predict(missing_data[['Total_Pop', 'White_Male', 'White_Fem', 'Black_Male', 'Black_Fem', 
                 'Amer_In_AK_Nat_Male', 'Amer_In_AK_Nat_Fem', 'Asian_Male', 'Asian_Fem', 
                 'Nat_HI_Pac_Isl_Male', 'Nat_HI_Pac_Isl_Fem', 'Hispanic_Male', 
                 'Hispanic_Fem']].values)
In [68]:
rf_clf.fit(X, y_sex)  # refitting our classifier to our sex z scores

# predicting z_score for sex
missing_data['Z_Score_Sex'] = rf_clf.predict(missing_data[['Total_Pop', 'White_Male', 'White_Fem', 'Black_Male', 'Black_Fem', 
                 'Amer_In_AK_Nat_Male', 'Amer_In_AK_Nat_Fem', 'Asian_Male', 'Asian_Fem', 
                 'Nat_HI_Pac_Isl_Male', 'Nat_HI_Pac_Isl_Fem', 'Hispanic_Male', 
                 'Hispanic_Fem']].values)
In [69]:
# Now, let's add a predicted column to both merged_ and missing_data--false and true respectively.
# this will help us keep track of them once they're combined
merged_data['Predicted'] = 'false'
missing_data['Predicted'] = 'true'
missing_data.head()
Out[69]:
State County Total_Pop Total_Male Total_Fem White_Male White_Fem Black_Male Black_Fem Amer_In_AK_Nat_Male ... White_Pop Black_Pop Hispanic_Pop Amer_In_AK_Nat_Pop Asian_Pop Nat_HI_Pac_Isl_Pop Maj_Pop Z_Score_Race Z_Score_Sex Predicted
0 AK Baxter County 41131 0.482458 0.517542 0.457100 0.491746 0.001848 0.000997 0.003209 ... 0.948846 0.002845 0.022635 0.006127 0.005227 0.000413 White -1 1 true
1 AK Benton County 259212 0.495745 0.504255 0.360701 0.377050 0.009116 0.007843 0.007272 ... 0.737751 0.016959 0.163411 0.014999 0.039551 0.005420 White 0 0 true
2 AK Ashley County 20530 0.485046 0.514954 0.334632 0.346371 0.113541 0.136191 0.001364 ... 0.681003 0.249732 0.052655 0.002971 0.002143 0.000195 White 1 1 true
3 AK Bradley County 10959 0.488183 0.511817 0.269185 0.286979 0.127840 0.147459 0.001369 ... 0.556164 0.275299 0.152112 0.002555 0.002372 0.000456 White 2 1 true
4 AK Boone County 37259 0.493223 0.506777 0.464103 0.476932 0.001959 0.001449 0.003516 ... 0.941034 0.003409 0.024209 0.007327 0.005690 0.000725 White -1 1 true

5 rows × 28 columns

Admiring Our Predictions (ish)

Let's map our prediction maps. For this, we'll do exactly the same thing as we did above, using data from our missing_data dataframe in place of our merged_data dataframe.

One thing should be pretty obvious: some counties are darker than our scale. As it turns out, our classifier is not too hot. Because we're using population to predict z_score (as we lack adequate police data), populations with demographics not present in our merged_data dataframe are being predicted somewhat inaccurately.

Having said that, while the z_score may be wrong, we'll take the direction as correct. That is to say, a z score predicted to be positive will be assumed to be positive, and negative negative.

This is not quite as useful as a totally accurate z scor predition, but we can at least make an assumption about representation.

In [70]:
# mapping predicted race z score the same way as the actual z score was mapped above
pred_z_score_race_fig = px.choropleth(missing_data, geojson=counties, locations='FIPS', color='Z_Score_Race',
                           color_continuous_scale='OrRd',
                           range_color=(-3, 3),
                           scope='usa',
                           labels={'Z_Score_Race':'Pred. Z Score (Race)'}
                          )
pred_z_score_race_fig.update_layout(title_text = 'Predicted Z Score for Race by County', margin={'r':0,'t':0,'l':0,'b':0})
pred_z_score_race_fig.show()
In [71]:
# mapping predicted sex z score the same way as the actual z score was mapped above
pred_z_score_sex_fig = px.choropleth(missing_data, geojson=counties, locations='FIPS', color='Z_Score_Sex',
                           color_continuous_scale='OrRd',
                           range_color=(-3, 3),
                           scope='usa',
                           labels={'Z_Score_Race':'Pred. Z Score (Sex)'}
                          )
pred_z_score_sex_fig.update_layout(title_text = 'Predicted Z Score for Sex by County', margin={'r':0,'t':0,'l':0,'b':0})
pred_z_score_sex_fig.show()

Hey! You Got Predictions in My Results!

Now that we have predicted results for both sex and race z scores, let's concatenate our actual results with our predicted results. This will allow us to visualize the two together easily.

All we'll do use the use pandas' concat method to concatenate on our two dataframes into one dataframe. To do this, we'll create an array of the dataframes we wish to use. Let's limit our dataframes to just the information we want: FIPS, our two z scores, majority population and our predicted boolean.

In [72]:
# array of dataframes to be concatenated
concat = [missing_data[['FIPS', 'Z_Score_Race', 'Z_Score_Sex', 'Maj_Pop', 'Predicted']], merged_data[['FIPS', 'Z_Score_Race', 'Z_Score_Sex', 'Maj_Pop', 'Predicted']]]
concat_data = pd.concat(concat) # concatenated dataframe -> concat_data

Let's Map--Again

For our final task, now that we have our dataframe, let's create a few more choropleths and a box plot.

When looking at these, it's important to remember that the z score range is inaccurate. Our predicted values skewed the range somewhat. Instead, try to think about how it relates to its surrounding counties. Are all surrounding counties positive while its negative? Vice versa? Does it perfectly relate to its surrounding counties?

From our z score maps, we can get a decent idea of where in the country population demographics are well reflected by their police forces.

Let's create, too, a choropleth using majority demographic. Does this line up at all with our z score maps? And does that make sense?

In [73]:
# mapping actual & predicted race z score as above
concat_z_score_race = px.choropleth(concat_data, geojson=counties, locations='FIPS', color='Z_Score_Race',
                           color_continuous_scale='OrRd',
                           range_color=(-3, 3),
                           scope='usa',
                           labels={'Z_Score_Race':'Z Score (Race): Pred. & Actual'}
                          )
concat_z_score_race.update_layout(title_text = 'Predicted & Actual Z Score for Race by County', margin={'r':0,'t':0,'l':0,'b':0})
concat_z_score_race.show()
In [74]:
# mapping actual & predicted race z score as above
concat_z_score_sex = px.choropleth(concat_data, geojson=counties, locations='FIPS', color='Z_Score_Sex',
                           color_continuous_scale='OrRd',
                           range_color=(-3, 3),
                           scope='usa',
                           labels={'Z_Score_Sex':'Z Score (Sex): Pred. & Actual'}
                          )
concat_z_score_sex.update_layout(title_text = 'Predicted & Actual Z Score for Sex by County', margin={'r':0,'t':0,'l':0,'b':0})
concat_z_score_sex.show()
In [75]:
# mapping majority demographic
maj_pop = px.choropleth(concat_data, geojson=counties, locations='FIPS', color='Maj_Pop',
                           range_color=(-3, 3),
                           scope='usa',
                           labels={'Maj_Pop' : 'Majority Demographic Population'}
                          )
maj_pop.update_layout(title_text = 'Majority Demographic by County', margin={'r':0,'t':0,'l':0,'b':0})
maj_pop.show()

Examining Our Final Choropleths

Notice anything?

First, our sex z score map is not too useful. There's not a whole lot of variation & the tendency is very much for police sex demographics to poorly represent population sex demographics.

But moving on to our race map, we can see somewhat plainly that cities with non-white majority populations tend to be poorly represented in their police force demographics. A problem, to be sure.

If we accept that this prediction is accurate (and as I've mentioned, it is not without its problems), it is clear from these maps that non-white demographics are almost always poorly represented, but are especially poorly represented in cities in which they are the majority population.

Let's see if a box plot can't help us visualize this a bit.

We'll call up our old friend seaborn, and, if they're not busy, make a boxplot using 'Maj_Pop' and 'Z_Score_Race'

In [76]:
# box plot -> grouping race z score by majority demogrpahic
sea.boxplot(x = 'Maj_Pop', y = 'Z_Score_Race', 
            data = concat_data).set(title = 'Race Z Score by Majority Population', 
                                    ylabel = 'Race Z Score',
                                    xlabel = 'Majority Population')
Out[76]:
[Text(0, 0.5, 'Race Z Score'),
 Text(0.5, 0, 'Majority Population'),
 Text(0.5, 1.0, 'Race Z Score by Majority Population')]

This boxplot is just another way of visualizing the disparity between representation and majority population. Majority white populations tend to be much better represented than any other population, with only majority Hispanic populations having some similarly well-represented police departments.

Majority black, Asian and American Indian or Alaska Native Populations tend to be poorly represented, especially American Indian or Alaska Native.

It is, however, tough to extrapolate too much from this boxplot, as it uses z scores outside of the range (i.e. its scale is inaccurate).

Still, it's a nice, concise visual representation--scale aside--displaying neatly the disparity in representation.

Conclusion

It is tempting to make the blanket statement, "If you are not white and not a man, your police department does not represent you." We've seen that police departments are overwhelmingly male, and that white populations tend to be better or over represented by their police departments.

Let me qualify my initial statement, though. By represent, I can only speak demographically. Nothing here is evidence enough to make a grander statement about police intention. It can, however, be used as a backdrop to police violence, in which case we could ask if poor representation correlates to higher concentrations of police violence.

Why is representation important? As long as they're doing their job, what else is necessary?

These are big questions! Representation shapes how we view not only ourselves but one another. And, okay, maybe that doesn't totally relate to the police--I hear you. If the police poorly reflect the community they serve, we might expect to see an emergence of (or increase in) in-group preference. The police, ideally, are an unbiased group, applying the laws exactly as much as necessary. Biases exist in all of us, sure, but we know that people in authority can wreak havoc thoughtlessly. How can you believe the police have your best intentions in mind when no or few police officers seem to represent you?

These are questions which I would need more information to answer. I can say, thought, that there is a clear problem in demographic representation within our police departments. If you have a New York Times subscription, this may be relevant (I do not, so cannot verify for certain).

I can say that we need more complete data on our police departments. While the data we used contained quite a bit of information, much was still missing with no clear indication of why.