by James Marchant \ for CMSC320 fall 2020
From minor interactions like passing an officer on the sidewalk to being arrested, we frequently encounter the police. Do you ever stop to think if an officer looks like you?
I live near a police department often have cars fly by my house, never really sure how far they're going. Recently, Karon Hylton-Brown was hit by a car a few blocks from me. In June 2020, protests arose around the country--around the world, in fact--spurred by the death of George Floyd and fueled by the death of Breonna Taylor and too many others.
Police brutality is nothing new. Rodney King's beating spurred unrest in LA in '92. Countless songs, films, plays invoke police injustice in some capacity.
Let's take a step back. I asked, earlier, "Do you ever stop to think if an officer looks like you?" The police represent us in the sense that they exist to protect and serve, but how well do they represent us? I would expect police departments to reflect pretty closely the demographics they serve. It turns out, though, that many don't live in the same city they serve.
I can't say if this is a problem--if it affects how the police police--but we can take a look at whether this disparity exists, and, if it does, where it's at its worst, its best, and in-between.
Our goal: compare the demographics of police departments across the US to the counties they serve.
Let's take a look, shall we?
We'll start with a few imports. We'll use a few standard ones--pandas, sklearn, seaborn, json, urllib--and a few less typical ones--uszipcode, plotly.
All of these will be used in a pretty straightforward manner:
# importing as necessary
import pandas as pd # for dataframes
import seaborn as sea # for graphing
import json # for graphing with plotly
#!pip install uszipcode # if uszipcode needs to be installed, uncomment this line!
from uszipcode import SearchEngine # for mapping zipcode to county
from sklearn.ensemble import RandomForestClassifier # for predicting missing data
#!pip install plotly # if plotly needs to be installed, uncomment this line!
import plotly.express as px # for creating a choropleth
from urllib.request import urlopen # used to grab our json
# Setting a few things up for usage later
sea.set(rc={'figure.figsize':(20,15)}) # modifying the default seaborn figure size
zip_search = SearchEngine(simple_zipcode=True) # setting up a search engine for zipcodes for usage a bit later
We'll get started with population data. This comes courtesy of the US Census Bureau. We'll be, specifically, using this csv which contains population estimates by county for age, race, sex and hispanic origin.
This csv contains a lot--a lot--of information, and we'll narrow it down quite a bit. But! First, let's open it and take a peek.
pop_data_file = 'cc-est2019-alldata.csv' # name of csv file
Well, before we open it, a quick aside: the encoding must be set to latin or else pandas cannot read it.
pop_data = pd.read_csv(pop_data_file, encoding='latin')
pop_data.head()
Most of this information is not particularly useful to us. The column names are unclear without the key, and will have to be renamed.
So what are we looking for?
Later, we'll be merging this dataframe with one containing police demographic data, so we need to remove information that is not contained within that dataframe (or which does not have a logical pair in the police dataframe). We only want to keep county, state, and population information for the year 2016.
We'll be removing quite a bit of, honestly, relevant information, unfortunately. The Census breaks down racial classification quite a bit more than we can use and allows for overlap between categories (which is important! but not usable here). We'll remove categories which do not focus on one race alone (e.g. black, white, hispanic) to limit overlap between categories. Additionally, when choosing columns, we'll use the "Not Hispanic" variables in order to better differentiate between each category/eliminate overlap where we can.
Below, we'll modify our dataframe to contain only relevant categories, and rename categories to a more readable format.
A couple of notes:
# narrowing pop_data to relevant columns only
pop_data = pop_data[['STATE', 'COUNTY', 'STNAME', 'CTYNAME', 'YEAR', 'AGEGRP', 'TOT_POP',
'TOT_MALE', 'TOT_FEMALE', 'NHWA_MALE', 'NHWA_FEMALE', 'NHBA_MALE',
'NHBA_FEMALE', 'NHIA_MALE', 'NHIA_FEMALE', 'NHAA_MALE', 'NHAA_FEMALE',
'NHNA_MALE', 'NHNA_FEMALE', 'H_MALE', 'H_FEMALE']]
# renaming pop_data columns
pop_data.columns = ['STFIPS', 'CTYFIPS', 'State', 'County', 'YEAR', 'AGEGRP', 'Total_Pop',
'Total_Male', 'Total_Fem', 'White_Male', 'White_Fem', 'Black_Male',
'Black_Fem', 'Amer_In_AK_Nat_Male', 'Amer_In_AK_Nat_Fem', 'Asian_Male',
'Asian_Fem', 'Nat_HI_Pac_Isl_Male', 'Nat_HI_Pac_Isl_Fem', 'Hispanic_Male',
'Hispanic_Fem']
Great! What's next?
We'll map the full state names to there abbreviation. This will assist in our merge later on, as the police dataframe contains abbreviations only.
After that, we'll concatenate FIPS information, drop redundant & no longer useful information.
First, let's map each state to its abbreviation:
# dictionary mapping state name to its abbreviation
states = {'Alabama' : 'AL', 'Alaska' : 'AK', 'Arizona' : 'AZ',
'Arkansas' : 'AK', 'California' : 'CA', 'Colorado' : 'CO',
'Connecticut' : 'CT', 'Delaware' : 'DE', 'District of Columbia' : 'DC',
'Florida' : 'FL', 'Georgia' : 'GA', 'Hawaii' : 'HI', 'Idaho' : 'ID',
'Illinois' : 'IL', 'Indiana' : 'IN', 'Iowa' : 'IA', 'Kansas' : 'KS',
'Kentucky' : 'KY', 'Louisiana' : 'LA', 'Maine' : 'ME', 'Maryland' : 'MD',
'Massachusetts' : 'MA', 'Michigan' : 'MI', 'Minnesota' : 'MN',
'Mississippi' : 'MS', 'Missouri' : 'MO', 'Montana' : 'MT',
'Nebraska' : 'NE', 'Nevada' : 'NV', 'New Hampshire' : 'NH',
'New Jersey' : 'NJ', 'New Mexico' : 'NM', 'New York' : 'NY',
'North Carolina' : 'NC', 'North Dakota' : 'ND', 'Ohio' : 'OH',
'Oklahoma' : 'OK', 'Oregon' : 'OR', 'Pennsylvania' : 'PA',
'Rhode Island' : 'RI', 'South Carolina' : 'SC', 'South Dakota' : 'SD',
'Tennessee' : 'TN', 'Texas' : 'TX', 'Utah' : 'UT', 'Vermont' : 'VT',
'Virginia' : 'VA', 'Washington' : 'WA', 'West Virginia' : 'WV',
'Wisconsin' : 'WI', 'Wyoming' : 'WY'}
# find_state returns the abbreviation for the state of the supplied row
def find_state(row):
return states[row['State']]
# by applying the above method, we're effectively mapping each state to its abbreviation
pop_data['State'] = pop_data.apply(find_state, axis=1)
Next, let's convert STFIPS and CTYFIPS to a more useable format. We'll convert each column to strings and pad them with 0s. All that's left after that is to concatenate them.
# converting STFIPS and CTYFIPS to strings
pop_data['STFIPS'] = pop_data['STFIPS'].astype(str)
pop_data['CTYFIPS'] = pop_data['CTYFIPS'].astype(str)
# padding as appropriate with 0s. FIPS is a five digit code--think zipcode for
# counties--in which the first two digits represent the state and the remaining
# three represent the county.
pop_data['STFIPS'] = pop_data['STFIPS'].str.pad(width=2, side='left', fillchar='0')
pop_data['CTYFIPS'] = pop_data['CTYFIPS'].str.pad(width=3, side='left', fillchar='0')
# now that they're padded, we can concatenate to get our full FIPS category
pop_data['FIPS'] = pop_data['STFIPS'] + pop_data['CTYFIPS']
Now that we're done with FIPS, let's move on to removing redundant information and rearranging the information a bit. We no longer need STFIPS or CTFIPS for example. First, we'll sort by state just to group similar information near each other. This is more of an aesthetic decision than anything else. Next, we'll drop years other than 2016 and AGEGRPs other than 0 (which represents all age groups). Finally, we'll drop the YEAR, AGEGRP, STFIPS, CTYFIPS columns and reset the index.
pop_data.sort_values(by=['State'], inplace=True) # sorting by state value
pop_data = pop_data[pop_data['YEAR'] == 9] # dropping years other than 2016 (9)
pop_data = pop_data[pop_data['AGEGRP'] == 0] # dropping rows for specific AGEGRPs
pop_data = pop_data.drop(columns=['YEAR', 'AGEGRP', 'STFIPS', 'CTYFIPS']) # dropping columns which are no longer useful
pop_data.reset_index(drop=True, inplace=True) # resetting the index
Let's take a look! We've gone from 80 columns to 18 (!) and have removed quite a lot of rows in the process as well.
We're not totally done with this dataframe and will add & modify columns below, but, for now, it's in a much more usable format.
Before we move on, let's talk about a few issues in how I've chosen to drop data. By removing identification categories in general, it is almost certainly the case that people are being removed/are no longer represented. Initially, I'd opted to have an 'other' category, but the overlap was too significant, making a not too useful category. Additionally, I have a bit of a problem thinking about race as a one or the other concept. No person can be classified as one category alone, and racial classifications can differ from country-to-country (in the US, race is largely color-oriented, whereas in Russia, it is often based on region).
While I am largely focusing on race, it is absolutely worth noting, too, sex/gender information is extremely limited. There's no inclusion of trans, genderqueer, or nonbinary people. Instead, people are categorized as either male or female. Some demographic information on trans people (from 2016, in fact), can be found here and here, and some relevant demographic information on both nonbinary and trans people can be found here (though I have not fully read this & cannot totally account for it).
Still, the information we can glean by separating into individual categories will be useful. It just needs to be contextualized with the understanding that it does not present the full picture.
pop_data.head()
We'll be using a tsv fils containing police demographic information by city/department. This file comes from the Bureau of Justics Statistics, though I've downloaded it from here--the variables lookup tab is quite handy.
It's worth noting, before diving into the data, that similar files exist for other years as well. These police surveys are held every 3-4 years. Unfortunately, the only other one which falls in the same range as the population data about is from 2013 and does not contain the same breadth of information as the 2016 file.
Onto the file we're working with. Quite a bit of information is contained within this file, so we'll have to narrow it down significantly. Some of what we're removing contains information on training and resource allocation which would be quite useful in analzying crime in genneral.
However, as we're looking primarily at demographic representation, we'll remove that information. I would encourage anyone interested to investigate further by looking through the variable list above.
As before, we'll start by creating a dataframe.
police_data_file = '37323-0001-Data.tsv' # file name as I've saved it
police_data = pd.read_csv(police_data_file, sep='\t') # as this is a tsv, it's tab-separated
police_data.head()
434 columns is far too many, but before we narrow it down, let's add one more. Pers_Total is as it sounds: the total number of police officers. It's not totally all encompassing--only full-time officers are included--but we'll be working exclusively with full-time officers (more on that below)
police_data['Pers_Total'] = police_data['PERS_MALE'] + police_data['PERS_FEMALE'] # adding a category for total full-time officers
Our dataframe contains information on both part-time and full-time police staff. It does not, however, provide information on what constitutes part-time employment. I, for example, work one day a week as a barista and am part-time, but I used to work five days a week and was still considered part-time because I worked fewer than 36 hours. Because of this range, I am opting to include only full-time officers; that is, officers for whom policing must be a primary form of employment, even if they have other jobs.
We'll remove most information which does not have a clear match in our population dataframe. Some additional columns will be kept but not used, specifically AGENCYNAME and CITY. I've included those simply because they are relevant and could be used to take a deeper dive into a some state's demographic data.
At this point, it's worth mentioning that not all police departments are included in this dataframe. I do no know why--some may have been too small, some may have chosen not to respond to the survey--as such, it is likely the case that some counties will not have full police demographic data.
Ideally, we'll have access to the primary/largest police departments in each county, but it could absolutely be the case that we do not have that information. From here on, we will have to operate under the assumption that our county police data (as we'll compute it) is truly representative of the county. The impact could be minimal or could be substantial, but it's difficult to know for sure.
# removing irrelevant categories
police_data = police_data[['AGENCYNAME', 'CITY', 'ZIPCODE', 'STATE', 'COUNTY','Pers_Total', 'PERS_MALE',
'PERS_FEMALE', 'PERS_WHITE_MALE', 'PERS_WHITE_FEM', 'PERS_BLACK_MALE', 'PERS_BLACK_FEM',
'PERS_HISP_MALE', 'PERS_HISP_FEM', 'PERS_AMIND_MALE', 'PERS_AMIND_FEM',
'PERS_ASIAN_MALE', 'PERS_ASIAN_FEM', 'PERS_HAWPI_MALE', 'PERS_HAWPI_FEM']]
# renaming into a more readable format/a format aligned with our pop_data dataframe
police_data.columns = ['Agency', 'City', 'Zipcode', 'State', 'County', 'Total_Police', 'Total_P_Male', 'Total_P_Fem',
'P_White_Male', 'P_White_Fem', 'P_Black_Male', 'P_Black_Fem', 'P_Hispanic_Male',
'P_Hispanic_Fem', 'P_Amer_IN_AK_Nat_Male', 'P_Amer_In_AK_Nat_Fem',
'P_Asian_Male', 'P_Asian_Fem', 'P_Nat_HI_Pac_Isl_Male',
'P_Nat_HI_Pac_Isl_Fem']
police_data.head()
We're close! But not there yet! County data is inomplete--not all counties are represented and the formatting is off from what we'd like. Next, we'll quickly lookup county information using zip_search.by_zipcode(zipcode) which returns a zipcode object & grabbing its county field.
# find_county uses the by_zipcode function to search for
# information on the provided zipcode. As we only want the
# county info, we'll return the county field--a string
#
# luckily, the county field is identical to the pop. data's
# county field, making it easy to merge on them later.
def find_county(row):
return zip_search.by_zipcode(row['Zipcode']).county
police_data['County'] = police_data.apply(find_county, axis=1) # modifying county data/removing null county information
For whatever reason, some of our rows have negative values for total police and for total female police officers. As such, we'll remove those from our dataframe. It is not clear why these values are negative, so I feel it's better to exclude them entirely.
# removing negative demographic information
police_data = police_data[(police_data['Total_Police'] >= 0) & (police_data['Total_P_Fem'] >= 0)]
The below county was not mapped by zipcode--the zipcode object had no county data--so we'll set it manually ourselves. With that our county information is complete!
police_data.at[2590, 'County'] = 'Salt Lake County'
As mentioned above, the data here may not be fully indicative of actual county police demographic data, but without access to more information, we will operate under the assumption that it is.
Before we can merge this with our population data, we'll need to convert it from city-level data to county-level data. This is pretty simple. We'll groupby state (sorry) and county and sum across each group. The resultant dataframe will be county-level police data.
Once we have our properly grouped dataframe, we can drop the zipcode column, as we'll no longer be needing it.
pol_county_data = police_data.groupby(['State', 'County']).sum() # creating county-level dataframe
pol_county_data.drop(columns=['Zipcode'], inplace=True) # dropping zipcode column
Now that we've got our narrowed our police dataframe to the county-level, we're just about ready to merge! But first, we'll modify our columns such that each category is now a percentage. This pretty simple. Each category is just divided by the total number officers (e.g. total female = total female / total officers gives a percentage rather than a number.). Doing this puts the information in a more easily digestible format and makes it easier to compare to our population data later on.
Next, we'll add total population for each race by adding the male and female subcategories of each race (e.g. total hispanic = hispanic female + hispanic male). After that, our police data will be just about ready.
# converting each demographic category to a percentage
pol_county_data['Total_P_Male'] = pol_county_data['Total_P_Male'] / pol_county_data['Total_Police']
pol_county_data['Total_P_Fem'] = pol_county_data['Total_P_Fem'] / pol_county_data['Total_Police']
pol_county_data['P_White_Male'] = pol_county_data['P_White_Male'] / pol_county_data['Total_Police']
pol_county_data['P_White_Fem'] = pol_county_data['P_White_Fem'] / pol_county_data['Total_Police']
pol_county_data['P_Black_Male'] = pol_county_data['P_Black_Male'] / pol_county_data['Total_Police']
pol_county_data['P_Black_Fem'] = pol_county_data['P_Black_Fem'] / pol_county_data['Total_Police']
pol_county_data['P_Hispanic_Male'] = pol_county_data['P_Hispanic_Male'] / pol_county_data['Total_Police']
pol_county_data['P_Hispanic_Fem'] = pol_county_data['P_Hispanic_Fem'] / pol_county_data['Total_Police']
pol_county_data['P_Amer_In_AK_Nat_Male'] = pol_county_data['P_Amer_IN_AK_Nat_Male'] / pol_county_data['Total_Police']
pol_county_data['P_Amer_In_AK_Nat_Fem'] = pol_county_data['P_Amer_In_AK_Nat_Fem'] / pol_county_data['Total_Police']
pol_county_data['P_Asian_Male'] = pol_county_data['P_Asian_Male'] / pol_county_data['Total_Police']
pol_county_data['P_Asian_Fem'] = pol_county_data['P_Asian_Fem'] / pol_county_data['Total_Police']
pol_county_data['P_Nat_HI_Pac_Isl_Male'] = pol_county_data['P_Nat_HI_Pac_Isl_Male'] / pol_county_data['Total_Police']
pol_county_data['P_Nat_HI_Pac_Isl_Fem'] = pol_county_data['P_Nat_HI_Pac_Isl_Fem'] / pol_county_data['Total_Police']
# adding population categories for each represented race
pol_county_data['P_White_Pop'] = pol_county_data['P_White_Male'] + pol_county_data['P_White_Fem']
pol_county_data['P_Black_Pop'] = pol_county_data['P_Black_Male'] + pol_county_data['P_Black_Fem']
pol_county_data['P_Hispanic_Pop'] = pol_county_data['P_Hispanic_Male'] + pol_county_data['P_Hispanic_Fem']
pol_county_data['P_Amer_In_AK_Nat_Pop'] = pol_county_data['P_Amer_In_AK_Nat_Male'] + pol_county_data['P_Amer_In_AK_Nat_Fem']
pol_county_data['P_Asian_Pop'] = pol_county_data['P_Asian_Male'] + pol_county_data['P_Asian_Fem']
pol_county_data['P_Nat_HI_Pac_Isl_Pop'] = pol_county_data['P_Nat_HI_Pac_Isl_Male'] + pol_county_data['P_Nat_HI_Pac_Isl_Fem']
pol_county_data.reset_index(inplace=True) # resetting the index--mostly for aesthetic reasons
For our final modification, we'll replace any na values with 0. If no information was reported for a category, I am comfortable assuming that there was nothing to report (though, of course, this may not be the case).
pol_county_data.fillna(0, inplace=True) # missing data treated as 0
Here comes the merge! Or not! Turns out, we need to modify our population data a bit too. No sweat! All we need to do is modify it in the same we modified our police county data above.
We'll convert all demographic columns to percentages and add population columns for each demographic. In addition to that, we'll add a column for the majority population of each county.
# converting each demographic column to percentage of total population
pop_data['Total_Male'] = pop_data['Total_Male'] / pop_data['Total_Pop']
pop_data['Total_Fem'] = pop_data['Total_Fem'] / pop_data['Total_Pop']
pop_data['White_Male'] = pop_data['White_Male'] / pop_data['Total_Pop']
pop_data['White_Fem'] = pop_data['White_Fem'] / pop_data['Total_Pop']
pop_data['Black_Male'] = pop_data['Black_Male'] / pop_data['Total_Pop']
pop_data['Black_Fem'] = pop_data['Black_Fem'] / pop_data['Total_Pop']
pop_data['Hispanic_Male'] = pop_data['Hispanic_Male'] / pop_data['Total_Pop']
pop_data['Hispanic_Fem'] = pop_data['Hispanic_Fem'] / pop_data['Total_Pop']
pop_data['Amer_In_AK_Nat_Male'] = pop_data['Amer_In_AK_Nat_Male'] / pop_data['Total_Pop']
pop_data['Amer_In_AK_Nat_Fem'] = pop_data['Amer_In_AK_Nat_Fem'] / pop_data['Total_Pop']
pop_data['Asian_Male'] = pop_data['Asian_Male'] / pop_data['Total_Pop']
pop_data['Asian_Fem'] = pop_data['Asian_Fem'] / pop_data['Total_Pop']
pop_data['Nat_HI_Pac_Isl_Male'] = pop_data['Nat_HI_Pac_Isl_Male'] / pop_data['Total_Pop']
pop_data['Nat_HI_Pac_Isl_Fem'] = pop_data['Nat_HI_Pac_Isl_Fem'] / pop_data['Total_Pop']
# adding population columns for each demographic
pop_data['White_Pop'] = pop_data['White_Male'] + pop_data['White_Fem']
pop_data['Black_Pop'] = pop_data['Black_Male'] + pop_data['Black_Fem']
pop_data['Hispanic_Pop'] = pop_data['Hispanic_Male'] + pop_data['Hispanic_Fem']
pop_data['Amer_In_AK_Nat_Pop'] = pop_data['Amer_In_AK_Nat_Male'] + pop_data['Amer_In_AK_Nat_Fem']
pop_data['Asian_Pop'] = pop_data['Asian_Male'] + pop_data['Asian_Fem']
pop_data['Nat_HI_Pac_Isl_Pop'] = pop_data['Nat_HI_Pac_Isl_Male'] + pop_data['Nat_HI_Pac_Isl_Fem']
In order to add our majority population column, we'll construct a list by iterating over each row in our dataframe. All we'll do is check which population is greatest for any given county and append that to our list.
The final list will be our majority population column, and will be added to our pop_data dataframe.
maj_pop = []
for index, row in pop_data.iterrows():
if (row['White_Pop'] > row['Black_Pop']
and row['White_Pop'] > row['Hispanic_Pop']
and row['White_Pop'] > row['Amer_In_AK_Nat_Pop']
and row['White_Pop'] > row['Asian_Pop']
and row['White_Pop'] > row['Nat_HI_Pac_Isl_Pop']):
maj_pop.append('White')
elif (row['Black_Pop'] > row['Hispanic_Pop']
and row['Black_Pop'] > row['Amer_In_AK_Nat_Pop']
and row['Black_Pop'] > row['Asian_Pop']
and row['Black_Pop'] > row['Nat_HI_Pac_Isl_Pop']):
maj_pop.append('Black')
elif (row['Hispanic_Pop'] > row['Amer_In_AK_Nat_Pop']
and row['Hispanic_Pop'] > row['Asian_Pop']
and row['Hispanic_Pop'] > row['Nat_HI_Pac_Isl_Pop']):
maj_pop.append('Hispanic')
elif (row['Amer_In_AK_Nat_Pop'] > row['Asian_Pop']
and row['Amer_In_AK_Nat_Pop'] > row['Nat_HI_Pac_Isl_Pop']):
maj_pop.append('American Indian or Alaska Native')
elif row['Asian_Pop'] > row['Nat_HI_Pac_Isl_Pop']:
maj_pop.append('Asian')
else:
maj_pop.append('Native Hawaiian or Pacific Islander')
pop_data['Maj_Pop'] = maj_pop
Let's take a look at our dataframes before we merge. We can see, now, how each column in one dataframe corresponds to another in the other dataframe. Our counties don't totally line up, either because certain counties aren't present or simply aren't visible.
Still, we can see, from our brief look at the head of our population data, that demographics tend to split pretty evenly between male and female, with a female tending to be the larger of the two demographics. Our police data (representing different counties, admittedly) skews largely male. Greater than 80 percent of police officers in each county shown in our sample are male. That's wild! We'll need to look deepr.
But first, take a look at Matanuska-Susitna Borough County in the pol_county_data. It is missing all racial demographic information. We could drop the row, as it will clearly be an outlier in our representations of race. However, its population is relatively small, and I would like to use its sex demographics. I am willing to let it skew our already skewed data a bit by letting it stay. Should I? Who knows?
pop_data.head()
pol_county_data.head()
We are all set to merge. Using the pandas merge method, we'll do an inner join State and County. Any non-matched State/County pair will be dropped from our merged dataframe, but all matched data will remain. Neat.
This is where the fun begins (or continues).
merged_data = pd.merge(pop_data, pol_county_data, on=['State', 'County']) # merging on State and County columns
merged_data.head()
We obviously have quite a few more columns now, but that's OK. We've named our columns such that it's easy enough to remember how they correspond to one another--adding a 'P' says it's police info.
Now that we've assembled our information as we'd like, let's move on to visualizing.
After, of course, we add another category.
Sex Difference is a measure of how much each demographic's police representation differs from its population percentage. In order to calculate this, we'll take the absolute value of the difference between the percentage of total population and percentage of police officers and weight it by each demographics percentage of total population. We'll multiply be 100 to make it more readable.
Importantly, difference is always positive, as it's summed by demographic (in this case, male and female). If we allowed for negative differences, it would be possible for positive & negative differences to cancel one another out.
Another way to think about this is just as the difference in percentage of population (total vs police) weighted by total population percentage.
# adding a sex difference category, which can be used to visualize representation by gender.
merged_data['Sex_Difference'] = ((100 * abs(merged_data['Total_Male'] - merged_data['Total_P_Male'])
* merged_data['Total_Male'])
+ (100 * abs(merged_data['Total_Fem'] - merged_data['Total_P_Fem'])
* merged_data['Total_Fem']))
Let's start with a scatterplot. We'll map points by FIPS county code and sex difference. This will be an extremely crowded x-axis. It is important to remember, too, that FIPS will be sorted by state & county. Points near each other on the x-axis are possibly in the same state.
We'll think more about our plot later.
First, let's plot.
We'll use seaborn to plot our scatterplot. I've imported seaborn as sea. I think it's typically imported as sns, but I read that as sneeze and prefer sea for that silly reason.
We'll set our x-axis as 'FIPS', y-axis as 'Sex_Difference' and hue as 'Maj_Pop' which we found earlier. By using it as a hue, we get a visual sense for how different majority populations are represented by their police using sex difference (and we'll do this again using race later).
# plotting sex difference by FIPS county code
sea.scatterplot(x = 'FIPS', y = 'Sex_Difference', hue='Maj_Pop',
data = merged_data).set(title = 'Sex Difference by County Code (FIPS)',
ylabel = 'Difference in Representation by Sex',
xlabel = 'FIPS County Code')
Let's do this again, but this time with 'State' as our x-axis instead of 'FIPS'. this will give us an easier to interpret plot, hopefully.
You'll notice that points align pretty closely between plots, as state order and FIPS align.
# plotting sex_difference by state
sea.scatterplot(x = 'State', y = 'Sex_Difference', hue='Maj_Pop',
data = merged_data).set(title = 'Sex Difference by State',
ylabel = 'Difference in Representation by Sex',
xlabel = 'State')
Well, not too much relationally. There's no clear relationship between sex and representation. It seems that relatively few counties represent the sex demographics of their constituents well. More often than not, the difference is in the range of 30 to 50, demonstrating relatively poor representation.
This doesn't show us any information on things like population size or which demographic is being overly represented.
What we do know, though, is that this is pretty universal across racial demographics, at least in terms of majority population. We don't know how race impacts sex representation (e.g. are black women typically better represented than Hispanic women?), and we'll primarily be operating on sex and race distinctly. Still, it's important to keep in mind.
Let's see if we can find any clear relationship between population and sex data. Instead of plotting difference against state let's plot corresponding categories against one another and see if we can learn anything useful.
For this, we'll use a lmplot, again with seaborn.
For our first plot, let's plot male police officers against male population percentage. Setting logistic to True will fit a logistic regression model, giving us a decent visual indicator of any relationship.
# plotting police male percentage by total male percentage
sea.lmplot(y = 'Total_P_Male', x = 'Total_Male', logistic=True,
data = merged_data, height=10, aspect=2).set(title = 'Police Percentage Male vs. Pop. Percentage Male',
ylabel = 'Police Percentage Male',
xlabel = 'Population Percenage Male')
Let's do the same as the above but with female population data for police and population.
# plotting police female percentage by total female percentage
sea.lmplot(y = 'Total_P_Fem', x = 'Total_Fem', logistic=True,
data = merged_data, height=10, aspect=2).set(title = 'Police Percentage Female vs. Pop. Percentage Female',
ylabel = 'Police Percentage Female',
xlabel = 'Population Percenage Female')
This is much more useful than our above scatter plots. Both plots show a pretty clear relationship in which, as each demographic increases, police representation (i.e. the percentage of police officers of the same demographic) increases. Notably. the range is wildly different for each. Female representation rarely ventures above 20%; conversely, male representation rarely ventures below 80%.
What does this tell us, and why is this important? Well, it tells us that women are far less likely to be well-represented by their police force. More often than not, men will police women (and men) regardless of population dynamics.
This is not new information as it turns out, but it is important. The information we have does not extend itself well to answering the why--why are men more likely to be police officers than women. It may have to do with on the job experiences, societal structures encouraging men to be authoritative and women to be subservient, or anything else (all of the above!). We do not have enough information here to make a reasonable assessment, but our finding is still important because it provides motive to research further.
If we know that a gap exists, it is reasonable to seek answers.
Unlike with sex demographics, we won't create scatter plots by FIPS or by State (we'll have a better visualization tool later on). Instead, let's break racial demographics apart and plot lmplots in much the same way we did for sex.
The goal of these plots is to simply give us a relationship between police representation as a function of demographic population.
# plotting white police percentage by white population percentage
sea.lmplot(y = 'P_White_Pop', x = 'White_Pop', logistic=True,
data = merged_data, height=10, aspect=2).set(title = 'Police Percentage White vs. Pop. Percentage White',
ylabel = 'Police Percentage White',
xlabel = 'Population Percenage White')
# plotting black police percentage by black population percentage
sea.lmplot(y = 'P_Black_Pop', x = 'Black_Pop', logistic=True,
data = merged_data, height=10, aspect=2).set(title = 'Police Percentage Black vs. Pop. Percentage Black',
ylabel = 'Police Percentage Black',
xlabel = 'Population Percenage Black')
# plotting Hispanic police percentage by Hispanic population percentage
sea.lmplot(y = 'P_Hispanic_Pop', x = 'Hispanic_Pop', logistic=True,
data = merged_data, height=10, aspect=2).set(title = 'Police Percentage Hispanic vs. Pop. Percentage Hispanic',
ylabel = 'Police Percentage Hispanic',
xlabel = 'Population Percenage Hispanic')
# plotting American Indian or Alaska Native police percentage by American Indian or Alaska Native population percentage
sea.lmplot(y = 'P_Amer_In_AK_Nat_Pop', x = 'Amer_In_AK_Nat_Pop', logistic=True,
data = merged_data, height=10, aspect=2).set(title = 'Police Percentage American Indian or Alaska Native vs. Pop. Percentage American Indian or Alaska Native',
ylabel = 'Police Percentage American Indian or Alaska Native',
xlabel = 'Population Percenage American Indian or Alaska Native')
# plotting Asian police percentage by Asian population percentage
sea.lmplot(y = 'P_Asian_Pop', x = 'Asian_Pop', logistic=True,
data = merged_data, height=10, aspect=2).set(title = 'Police Percentage Asian vs. Pop. Percentage Asian',
ylabel = 'Police Percentage Asian',
xlabel = 'Population Percenage Asian')
# plotting Native Hawaiian or Other Pacific Island police percentage by Native Hawaiian or Other Pacific Island population percentage
sea.lmplot(y = 'P_Nat_HI_Pac_Isl_Pop', x = 'Nat_HI_Pac_Isl_Pop', logistic=True,
data = merged_data, height=10, aspect=2).set(title = 'Police Percentage Native Hawaiian or Other Pacific Island vs. Pop. Percentage Native Hawaiian or Other Pacific Island',
ylabel = 'Police Percentage Native Hawaiian or Other Pacific Island',
xlabel = 'Population Percenage Native Hawaiian or Other Pacific Island')
In a perfect representation, we would expect to see a one-to-one relationship between each police demographic and each population demographic. Instead, we see quite a bit of disparate representation relationships.
Starting with the relationship between white demographics, we can see that even when barely represented (if at all) in the total population, white people make up around 20% of the police force. In fact, as population increases, white police demographics tend to be significantly higher than their population representation, at least until the white population is a sizable majority. As with males in sex demographics, we can see that white people are disproportionately represented by the police.
In each other demographic, the relationship is pretty much the opposite. As their population increases, they are represented better by the police department, but the population percentage is almost always greater than the police percentage.
Black people are always underrepresented, sometimes significantly. Even in populations in which black people make up the vast majority of the population, they are underrepresented by their police forces.
Hispanic people, interestingly enough, have a near-linear relationship from around 60% total population on.
For American Indian or Alaska Native, Asian and Native Hawaiian or Other Pacific Island, we see mostly underrepresentation, with Asian people being the best represented.
This is not the one-to-one relationship we expected. In fact, unless you're white, you're likely to be underrepresented, sometimes significantly. This is a clear problem. Police departments have now been shown to both underrepresent non-male people and now non-white people as well. In essence, if you are not a white male, you are almost certainly poorly represented by your police force.
Yes, I lied. We are in fact going to plot race demographic differences in much the same way that we did sex demographic differences. We're doing this simply as a quick visualization tool--a way to see how much variation we see in representation before we move into the more interesting visualization tools.
First, let's create a 'Race_Difference' column in our merged_data DataFrame. This will be done the same way that we computed our 'Sex_Difference' column, using race population data instead of sex population data
merged_data['Race_Difference'] = ((100 * abs(merged_data['White_Pop'] - merged_data['P_White_Pop'])
* merged_data['White_Pop'])
+ (100 * abs(merged_data['Black_Pop'] - merged_data['P_Black_Pop'])
* merged_data['Black_Pop'])
+ (100 * abs(merged_data['Hispanic_Pop'] - merged_data['P_Hispanic_Pop'])
* merged_data['Hispanic_Pop'])
+ (100 * abs(merged_data['Amer_In_AK_Nat_Pop'] - merged_data['P_Amer_In_AK_Nat_Pop'])
* merged_data['Amer_In_AK_Nat_Pop'])
+ (100 * abs(merged_data['Asian_Pop'] - merged_data['P_Asian_Pop'])
* merged_data['Asian_Pop'])
+ (100 * abs(merged_data['Nat_HI_Pac_Isl_Pop'] - merged_data['P_Nat_HI_Pac_Isl_Pop'])
* merged_data['Nat_HI_Pac_Isl_Pop']))
Now that we have that, let's take a moment to visualize race difference by FIPS and by state, as we did earlier.
# plotting race_difference by FIPS
sea.scatterplot(x = 'FIPS', y = 'Race_Difference', hue='Maj_Pop',
data = merged_data).set(title = 'Race Difference by State',
ylabel = 'Difference in Representation by Race',
xlabel = 'State')
# plotting race_difference by State
sea.scatterplot(x = 'State', y = 'Race_Difference', hue='Maj_Pop',
data = merged_data).set(title = 'Race Difference by State',
ylabel = 'Difference in Representation by Race',
xlabel = 'State')
While racial differences certainly exist, there is less of an outright glaring disparity. But, as before, we'll need to dig a little deeper before claiming too much.
Let's plot race difference by demographic populations. This will hopefully give us greater insight into how well different demographics are represented. To do this, we'll go back to our old friend the lmlplot, using 'X_Pop' as our x-axis and 'Race_Difference' as our y-axis.
Remember our perfect representation world as we go through each plot. A positive correlation implies worsening representation as demographic population increases, a negative correlation the opposite.
sea.lmplot(x = 'Black_Pop', y = 'Race_Difference', data = merged_data,
height=10, aspect=2).set(title = 'Race Difference as Black Demographic Percentage Increases',
ylabel = 'Race Difference',
xlabel = 'Black Demographic Percentage of Total Pop.')
sea.lmplot(x = 'Hispanic_Pop', y = 'Race_Difference', data = merged_data,
height=10, aspect=2).set(title = 'Race Difference as Hispanic Demographic Percentage Increases',
ylabel = 'Race Difference',
xlabel = 'Hispanic Demographic Percentage of Total Pop.')
sea.lmplot(x = 'White_Pop', y = 'Race_Difference', data = merged_data,
height=10, aspect=2).set(title = 'Race Difference as White Demographic Percentage Increases',
ylabel = 'Race Difference',
xlabel = 'White Demographic Percentage of Total Pop.')
sea.lmplot(x = 'Amer_In_AK_Nat_Pop', y = 'Race_Difference', data = merged_data,
height=10, aspect=2).set(title = 'Race Difference as American Indian or Alaska Native Demographic Percentage Increases',
ylabel = 'Race Difference',
xlabel = 'American Indian or Alaska Native Demographic Percentage of Total Pop.')
sea.lmplot(x = 'Asian_Pop', y = 'Race_Difference', data = merged_data,
height=10, aspect=2).set(title = 'Race Difference as Asian Demographic Percentage Increases',
ylabel = 'Race Difference',
xlabel = 'Asian Demographic Percentage of Total Pop.')
sea.lmplot(x = 'Nat_HI_Pac_Isl_Pop', y = 'Race_Difference', data = merged_data,
height=10, aspect=2).set(title = 'Race Difference as Native Hawaiian or Other Pacific Island Demographic Percentage Increases',
ylabel = 'Race Difference',
xlabel = 'Native Hawaiian or Other Pacific Island Demographic Percentage of Total Pop.')
Some of the above graphs are not quite as useful (namely Native Hawaiian or Other Pacific Island and American Indian or Alaska Native for which high population data is scarce), so I won't go onto their regression plots.
Let's sink a bit more deeply into our black, white, Hispanic and Asian plots. In all but our white plot, we can see that as the given demographic population increases, so too does racial disparity in police representation. The only demographic for which this is not the case is for white people, for whom racial disparity in representation decreases as population increases.
This suggests that non-white demographics are poorly represented even in counties with higher concentration of any given non-white demographic. Due to our previous graphs, this is a bit expected. Poorly represented people continue to be shown to be poorly represented. What we've gleaned, however, is that this does not change as a demographic's population increases--representation (and it's important to mention here that race difference does not necessarily show the difference of the race it's plotted against) in fact seems to worsen for non-white demographics.
Why use a scatter plot with FIPS when we can map it by county! In order to make this map a bit clearer, we'll first computer the standard deviation and mean of the Race_Difference category. We'll then add a Z_score column for race using those two values.
We'll use plotly to create a choropleth, mapping z_score to each county.
# computing mean and standard deviation using the .mean() and .std() functions
race_dif_mean = merged_data['Race_Difference'].mean()
race_dif_stddev = merged_data['Race_Difference'].std()
# z_score = (val - mean) / stddev
# rounding, casting as int for use later on
merged_data['Z_Score_Race'] = round((merged_data['Race_Difference'] - race_dif_mean) / race_dif_stddev).astype(int)
# let's do the same thing for sex as well.
# we'll compute the mean and standard deviation for sex_difference
sex_dif_mean = merged_data['Sex_Difference'].mean()
sex_dif_stddev = merged_data['Sex_Difference'].std()
# z_score for sex_difference
merged_data['Z_Score_Sex'] = round((merged_data['Sex_Difference'] - sex_dif_mean) / sex_dif_stddev).astype(int)
Luckily for us, plotly included a county fips GeoJSON in their choropleth tutorial. We can use this to map our FIPS county codes against their provided JSON.
We'll open the url and load the json.
# opening the counties-fips geojson for use in mapping our choropleths
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
# setting location to FIPS, color to our race z_score, color range -> std dev range
z_score_race_fig = px.choropleth(merged_data, geojson=counties, locations='FIPS', color='Z_Score_Race',
color_continuous_scale='OrRd',
range_color=(-3, 3),
scope='usa',
labels={'Z_Score_Race':'Z Score (Race)'}
)
z_score_race_fig.update_layout(title_text = 'Z Score for Race by County', margin={'r':0,'t':0,'l':0,'b':0})
z_score_race_fig.show()
# setting location to FIPS, color to our sex z_score, color range -> std dev range
z_score_sex_fig = px.choropleth(merged_data, geojson=counties, locations='FIPS', color='Z_Score_Sex',
color_continuous_scale='OrRd',
range_color=(-3, 3),
scope='usa',
labels={'Z_Score_Sex':'Z Score (Sex)'},
)
z_score_sex_fig.update_layout(title_text = 'Z Score for Sex by County', margin={'r':0,'t':0,'l':0,'b':0})
z_score_sex_fig.show()
Now that we can visualize Z score by county, we can visualize not only how representation differs but where!
One thing to keep in mind: our range starts at 0, so the lower the Z score, the better the representation.
Unfortunately, we're missing quite a bit of data, and that's no good. We can see that many counties are not represented--much of the map is empty. Let's predict their values, and map them.
A note: I could not add titles to my maps. I tried; I failed.
# y value for race predictions
y_race = round(merged_data['Z_Score_Race']).astype(int).values
# y value for sex predictions
y_sex = round(merged_data['Z_Score_Sex']).astype(int).values
# the same X will be used to predict both
X = merged_data[['Total_Pop', 'White_Male', 'White_Fem', 'Black_Male', 'Black_Fem',
'Amer_In_AK_Nat_Male', 'Amer_In_AK_Nat_Fem', 'Asian_Male', 'Asian_Fem',
'Nat_HI_Pac_Isl_Male', 'Nat_HI_Pac_Isl_Fem', 'Hispanic_Male',
'Hispanic_Fem']].values
We'll be using a RandomForestClassifier to predict missing data.
Something very important: I am going to do something that I likely should not.
Because so many counties are missing, I am not going to split my classification into training and testing datasets. Instead, our full, present dataset will be passed, fitted to the classifier. The classifier is not being cross-validated, so it's tough to say how accurate it actually is--and, indeed, whether it's worth using at all.
Prior to this, I did use a train, test split but could not get greater than ~60 percent accuracy. I can't test how accurate my current model is because it is fit to all known values.
Anyway, let's dive-in, keeping in mind that the prediction could very well be inaccurate.
# random forest classifier, fitting first with race z scores
rf_clf = RandomForestClassifier(max_depth=50, random_state=0)
rf_clf.fit(X, y_race)
Now, we need to know which counties are not present in our initial visualization.
To do that, we'll use isin--any FIPS present in both pop_data and merged_data will be dropped, and the resulting dataframe will be assigned to missing_data, our fancy new dataframe of unrepresented counties.
# constructing a dataframe of non-represented counties
missing_data = pop_data.drop(pop_data[pop_data['FIPS'].isin(merged_data['FIPS'])].index)
Now, we need to predict z scores for both race and sex, and add them to our missing data dataframe.
To do this, we'll simply pass appropriate columns to rf_clf.predict(), and store the predicted z scores in an appropriate column.
# predicting z_score for race
missing_data['Z_Score_Race'] = rf_clf.predict(missing_data[['Total_Pop', 'White_Male', 'White_Fem', 'Black_Male', 'Black_Fem',
'Amer_In_AK_Nat_Male', 'Amer_In_AK_Nat_Fem', 'Asian_Male', 'Asian_Fem',
'Nat_HI_Pac_Isl_Male', 'Nat_HI_Pac_Isl_Fem', 'Hispanic_Male',
'Hispanic_Fem']].values)
rf_clf.fit(X, y_sex) # refitting our classifier to our sex z scores
# predicting z_score for sex
missing_data['Z_Score_Sex'] = rf_clf.predict(missing_data[['Total_Pop', 'White_Male', 'White_Fem', 'Black_Male', 'Black_Fem',
'Amer_In_AK_Nat_Male', 'Amer_In_AK_Nat_Fem', 'Asian_Male', 'Asian_Fem',
'Nat_HI_Pac_Isl_Male', 'Nat_HI_Pac_Isl_Fem', 'Hispanic_Male',
'Hispanic_Fem']].values)
# Now, let's add a predicted column to both merged_ and missing_data--false and true respectively.
# this will help us keep track of them once they're combined
merged_data['Predicted'] = 'false'
missing_data['Predicted'] = 'true'
missing_data.head()
Let's map our prediction maps. For this, we'll do exactly the same thing as we did above, using data from our missing_data dataframe in place of our merged_data dataframe.
One thing should be pretty obvious: some counties are darker than our scale. As it turns out, our classifier is not too hot. Because we're using population to predict z_score (as we lack adequate police data), populations with demographics not present in our merged_data dataframe are being predicted somewhat inaccurately.
Having said that, while the z_score may be wrong, we'll take the direction as correct. That is to say, a z score predicted to be positive will be assumed to be positive, and negative negative.
This is not quite as useful as a totally accurate z scor predition, but we can at least make an assumption about representation.
# mapping predicted race z score the same way as the actual z score was mapped above
pred_z_score_race_fig = px.choropleth(missing_data, geojson=counties, locations='FIPS', color='Z_Score_Race',
color_continuous_scale='OrRd',
range_color=(-3, 3),
scope='usa',
labels={'Z_Score_Race':'Pred. Z Score (Race)'}
)
pred_z_score_race_fig.update_layout(title_text = 'Predicted Z Score for Race by County', margin={'r':0,'t':0,'l':0,'b':0})
pred_z_score_race_fig.show()
# mapping predicted sex z score the same way as the actual z score was mapped above
pred_z_score_sex_fig = px.choropleth(missing_data, geojson=counties, locations='FIPS', color='Z_Score_Sex',
color_continuous_scale='OrRd',
range_color=(-3, 3),
scope='usa',
labels={'Z_Score_Race':'Pred. Z Score (Sex)'}
)
pred_z_score_sex_fig.update_layout(title_text = 'Predicted Z Score for Sex by County', margin={'r':0,'t':0,'l':0,'b':0})
pred_z_score_sex_fig.show()
Now that we have predicted results for both sex and race z scores, let's concatenate our actual results with our predicted results. This will allow us to visualize the two together easily.
All we'll do use the use pandas' concat method to concatenate on our two dataframes into one dataframe. To do this, we'll create an array of the dataframes we wish to use. Let's limit our dataframes to just the information we want: FIPS, our two z scores, majority population and our predicted boolean.
# array of dataframes to be concatenated
concat = [missing_data[['FIPS', 'Z_Score_Race', 'Z_Score_Sex', 'Maj_Pop', 'Predicted']], merged_data[['FIPS', 'Z_Score_Race', 'Z_Score_Sex', 'Maj_Pop', 'Predicted']]]
concat_data = pd.concat(concat) # concatenated dataframe -> concat_data
For our final task, now that we have our dataframe, let's create a few more choropleths and a box plot.
When looking at these, it's important to remember that the z score range is inaccurate. Our predicted values skewed the range somewhat. Instead, try to think about how it relates to its surrounding counties. Are all surrounding counties positive while its negative? Vice versa? Does it perfectly relate to its surrounding counties?
From our z score maps, we can get a decent idea of where in the country population demographics are well reflected by their police forces.
Let's create, too, a choropleth using majority demographic. Does this line up at all with our z score maps? And does that make sense?
# mapping actual & predicted race z score as above
concat_z_score_race = px.choropleth(concat_data, geojson=counties, locations='FIPS', color='Z_Score_Race',
color_continuous_scale='OrRd',
range_color=(-3, 3),
scope='usa',
labels={'Z_Score_Race':'Z Score (Race): Pred. & Actual'}
)
concat_z_score_race.update_layout(title_text = 'Predicted & Actual Z Score for Race by County', margin={'r':0,'t':0,'l':0,'b':0})
concat_z_score_race.show()
# mapping actual & predicted race z score as above
concat_z_score_sex = px.choropleth(concat_data, geojson=counties, locations='FIPS', color='Z_Score_Sex',
color_continuous_scale='OrRd',
range_color=(-3, 3),
scope='usa',
labels={'Z_Score_Sex':'Z Score (Sex): Pred. & Actual'}
)
concat_z_score_sex.update_layout(title_text = 'Predicted & Actual Z Score for Sex by County', margin={'r':0,'t':0,'l':0,'b':0})
concat_z_score_sex.show()
# mapping majority demographic
maj_pop = px.choropleth(concat_data, geojson=counties, locations='FIPS', color='Maj_Pop',
range_color=(-3, 3),
scope='usa',
labels={'Maj_Pop' : 'Majority Demographic Population'}
)
maj_pop.update_layout(title_text = 'Majority Demographic by County', margin={'r':0,'t':0,'l':0,'b':0})
maj_pop.show()