-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copyedit data-into-choropleth-maps-with-python-and-folium #667
base: gh-pages
Are you sure you want to change the base?
Copyedit data-into-choropleth-maps-with-python-and-folium #667
Conversation
begin copyedits
copyedit to 282
Copyedit to 685
copyedit to end
Div boxes, headings, structural changes
Update DataFrame spelling
Very final edits
|
||
[Choropleth Maps](https://en.wikipedia.org/wiki/Choropleth_map) have become very familiar to us. They are commonly used to visualize information such as [Covid-19 infection/death rates](https://www.nytimes.com/interactive/2021/us/covid-cases.html#maps), [education spending per pupil](https://www.reddit.com/r/MapPorn/comments/bc9jwu/us_education_spending_map/), and other similar data. | ||
[Choropleth maps](https://en.wikipedia.org/wiki/Choropleth_map) are an excellent tool for discovering and demonstrating patterns in data that might be otherwise hard to discern. My grandfather, who worked at the [US Census Bureau](https://en.wikipedia.org/wiki/United_States_Census_Bureau), loved to pore over the tables of [The Statistical Abstract of the United States](https://www.census.gov/library/publications/time-series/statistical_abstracts.html). But tables are hard for people to understand: visualizations (like maps) are more helpful, as Alberto Cairo argues in _How Charts Lie_.[^1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll see that I moved up a few paragraphs from your Conclusion into the Introduction, because I felt that it's better to give readers a sense of what choropleth maps are, and what they can achieve, at the start of the lesson.
Your references to other materials also work well here at the beginning.
|
||
## Mapping lessons on *Programming Historian* | ||
### Lesson Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also moved the Lesson Goals section before the review of other Programming Historian lessons.
|
||
Using Pandas, these databases will be turned in to **dataframes** (DF). For the computer to match records in the different DFs, there needs to be common variable. Since the maps will be plotting county-level data, the common variable will be the **Federal Information Processing Standard** (FIPS) number. Many databases with county-level data include the FIPS number, but because the *Fatal Force* database does not, this lesson will walk through how to add it. | ||
The lesson uses data from the *[Washington Post](https://en.wikipedia.org/wiki/The_Washington_Post)*'s [Fatal Force database](https://github.com/washingtonpost/data-police-shootings), which is available to [download from _Programming Historian_'s GitHub repository](https://github.com/programminghistorian/ph-submissions/blob/gh-pages/assets/data-into-choropleth-maps-with-python-and-folium/fatal-police-shootings-data.csv). The *Post* started the database in 2015, seeking to document every time a civilian encounter with a police officer ends in the death of the civilian. This data is neither reported nor collected systematically by any other body, so the *Post*'s work fills an important lacuna in understanding how police in the US interact with the people around them. The *Post* provides [documentation](https://github.com/programminghistorian/ph-submissions/blob/gh-pages/assets/data-into-choropleth-maps-with-python-and-folium/fatal-force-database-README.md) about how this data is collected and recorded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've regrouped the information that was scattered across the ## Getting Started
section into each subsection: the Fatal Force dataset, the Counties dataset, and Matching the two datasets. Please let me know if you think it reads well!
As mentioned above, the main software this lesson uses is [Folium](https://python-visualization.github.io/folium/), a Python library that automates creating Leaflet maps. | ||
|
||
Folium makes it easy to create a wide variety of maps. For basic maps, the user doesn't need to work with HTML, CSS, or JavaScript: everything can be done within the Python ecosystem. Users can specify a variety of different basemaps (terrain, street maps, different colors) and display data with different markers, such as pins or circles. These can use different colors or sizes based on the data. | ||
The main software you'll use in this lesson is [Folium](https://python-visualization.github.io/folium/), a Python library that makes it easy to create a wide variety of Leaflet maps. You won't need to work with HTML, CSS, or JavaScrip: everything can be done within the Python ecosystem. Folium allows you to specify a variety of different basemaps (terrain, street maps, colors) and display data using various visual markers, such as pins or circles. The color and size of these markers can then be customized based on your data. Folium's advanced functions include creating cluster maps and heat maps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any useful links to explain cluster maps and heat maps?
* Line 2 (`geo_data =`) identifies the GeoJSON source of the geographic geometries to be plotted. This is the `counties` DataFrame downloaded from the US Census bureau. | ||
* Line 3 (`data =`) identifies the source of the data to be analyzed and plotted. This is the `map_df` DataFrame (counting the number of kills above 0 in each county), pulled from the Fatal Force DataFrame `ff_df`. | ||
* Line 4 (`key_on =`) identifies the field in the GeoJSON data that will be bound (or linked) to the data from the `map_df`. As noted earlier, Folium needs a common column between both DataFrames: here, the `FIPS` column. | ||
* Line 5 is required because the data source is a DataFrame. The `column =` parameter tells Folium which columns to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed data
to column
, as this seemed correct in the context.
One note: I noticed that some parameters use a space before the =
sign and others do not. Is this on purpose, or could we select only one format for all?
* The first list element is the variable that should be matched with the `key_on=` value. | ||
* The second element is the variable to be used to draw the choropleth map's colors. | ||
* Line 6 (`bins =`) specifies how many [bins](https://en.wikipedia.org/wiki/Data_binning) to sort the data values into. (The maximum number is limited by the number of colors in the color palette selected, often 9.) | ||
* Line 7 (`fill_color=`) specifies the color palette to use. Folium's documentation identifes the following built-in palettes: ‘BuGn’, ‘BuPu’, ‘GnBu’, ‘OrRd’, ‘PuBu’, ‘PuBuGn’, ‘PuRd’, ‘RdPu’, ‘YlGn’, ‘YlGnBu’, ‘YlOrBr’, and ‘YlOrRd’. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you supply a link to this specific documentation?
1. 1,596 counties (out of the 3,142 in the USA) have reported at least one police killing. | ||
1. At least 75% of these counties have had 5 or fewer killings. | ||
Thus, there must be a few counties in the top quartile that have had many more killings. | ||
1. 1,596 counties (out of the 3,142) have reported at least one police killing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funnily, these numbers are different from those at Line 366 (green side). Is this normal?
|
||
For this data, since most counties have fewer than 5 police killings, most counties will have a log value between 0 and 1. The biggest value (302) have a log value of between 2 and 3 (that is, between $$10^2$$ and $$10^3$$). | ||
Since most counties have under 5 killings, their $$\log_{10}$$ value would be between 0 and 1. The highest values (up to 302) have a $$\log_{10}$$ value between 2 and 3 (that is, the original values are between $$10^2$$ and $$10^3$$). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw 342 earlier. Which is correct?
@@ -558,63 +531,39 @@ baseMap | |||
|
|||
{% include figure.html filename="en-or-data-into-choropleth-maps-with-python-and-folium-04.png" alt="A choropleth map of the US showing how the Fisher-Jenks algorithm creates different bins of data" caption="Figure 4. The map colorized by the Fisher-Jenks algorithm." %} | |||
|
|||
This is an improvement: the map shows a better range of contrasts. You can see that there are a fair number of counties outside the Southwest where police have killed several people (Florida, the Northwest, etc.) | |||
This is already an improvement: the map shows a better range of contrasts. A higher number of counties outside the Southwest where police have killed several people (Florida, the Northwest, etc.) are now visible. However, the scale is almost impossible to read! The algorithm correctly found natural breaks – most of the values are less than 76 – but at the lower end of the scale, the numbers are illegible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't completely understand where the number 76 comes from: on the scale, I see 46 and 70.
|
||
## Get county-level population statistics | ||
Choropleth maps are often more accurate when they visualize ratios rather than raw values: for example, the number of cases per 100,000 population. Converting the data from values to ratios is called 'normalizing' data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use the term 'ratio' here, but switch to 'rate' below. Are they interchangeable?
|
||
# Add an Information Box to the Map | ||
Normalizing the data dramatically changes the appearance of the map. The initial visualization suggested that the problem of police killing civilians was limited to a few counties, generally those with large populations. But when the data is normalized, police killings of civilians seem far more widespread. The counties with the highest rates of killings are those with lower population numbers. Trying to illustrate this issue with charts or tables would not be nearly as effective. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you tell me which figure number this 'initial visualization' refers to?
|
||
To use this method, you will need to look "under-the-hood" of Folium. When Folium creates a choropleth map, it generates data about each geographic region. To access it, you need to save the choropleth data to a variable. | ||
When Folium creates a choropleth map, it generates underlying GeoJSON data about each geographic region. You can see this data by saving it to a variable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by the cp
variable here: is it the same as the one you used for the branca element above? If not, do you feel that it might be confusing to name two variables the same?
Hello @adamlporter and @nabsiddiqui, I've now prepared the copyedits for this lesson. I'd be grateful if you could review the adjustments and confirm that you are happy for me to merge these. You can see the details of my edits under the files changed tab! I'd like to bring your attention to the comments I have attached to specific lines. You can respond to any of my suggestions via the comments below, or click Resolve conversation if you're happy with them like this: If you want to make any changes at this stage, please let me know in the comments of this Pull Request, so we can work on them together. |
I've prepared my copyedits for en-or-data-into-choropleth-maps-with-python-and-folium.