Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenks #1634

Merged
merged 8 commits into from
Nov 17, 2022
Merged

Jenks #1634

Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 191 additions & 0 deletions examples/Jenks.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
{
Conengmo marked this conversation as resolved.
Show resolved Hide resolved
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "1bb0c3ba",
"metadata": {},
"outputs": [],
"source": [
"pip install jenkspy"
Conengmo marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aa1bb64d",
"metadata": {},
"outputs": [],
"source": [
"import folium\n",
"import numpy as np\n",
"import pandas as pd\n",
"import json\n",
"import requests\n",
"\n",
"import ssl\n",
"ssl._create_default_https_context = ssl._create_unverified_context"
Conengmo marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
"cell_type": "markdown",
"id": "c7fd1292",
"metadata": {},
"source": [
"# Integrating Jenks Natural Break Optimization\n",
"\n",
"Choropleths provide an easy way to visually see data distributions across geography. By default, folium uses the breaks created by numpy.histogram (np.histogram), which generally creates an evenly spaced quantiles.\n",
"\n",
"This works well enough for evenly distributed data, but for unevenly distributed data, these even quantiles can obscure more than they show. To demonstrate this, I have created maps showing the labor force of each US state.\n",
"\n",
"The data was taken from the county-level data and aggregated. Since our geographic data does not have areas representing Puerto Rico or the United States as a whole, I removed those entries while keeping Washington, D.C. in our data set. Already, looking at the first five states alphabetically, we can see that Alaska (AK) has a work force roughly 2% the size of California (CA)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a199cc25",
"metadata": {},
"outputs": [],
"source": [
"url = (\n",
" \"https://github.com/python-visualization/folium/main/examples/data\"\n",
")\n",
"us_states = f\"{url}/us-states.json\"\n",
"\n",
"geo_json_data = json.loads(requests.get(us_states).text)\n",
"\n",
"county_data = pd.read_csv(f\"{url}/us_county_data.csv\")\n",
"clf = 'Civilian_labor_force_2011'\n",
"labor_force = county_data[['State', clf]][\n",
" (county_data[clf].str.strip()!='') & (~county_data['State'].isin(['PR', 'US']))\n",
"]\n",
"labor_force[clf] = labor_force[clf].astype(int)\n",
"labor_force = labor_force.groupby('State').sum().reset_index()\n",
"\n",
"labor_force.head()"
]
},
{
"cell_type": "markdown",
"id": "4b570c1f",
"metadata": {},
"source": [
"Using default breaks, most states are represented as being part of the bottom quantile. This distribution is similar to what we might expect if US states follow a Power Law or a Zipf distribution."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8b06b85d",
"metadata": {},
"outputs": [],
"source": [
"m = folium.Map(location=[38, -96], zoom_start=4)\n",
"\n",
"folium.Choropleth(\n",
" geo_data=geo_json_data,\n",
" data=labor_force,\n",
" columns=['State', clf],\n",
" key_on='id',\n",
" fill_color='RdBu',\n",
").add_to(m)\n",
"\n",
"m"
]
},
{
"cell_type": "markdown",
"id": "4a36162e",
"metadata": {},
"source": [
"However, when using Jenks natural Breaks Optimization, we now see more granular detail at the bottom of the distribution, where most of our states are located. The upper western states (Idaho, Montana, Wyoming and the Dakotas) are distinguished from their Midwestern and Mountain West neighbors to the south. Gradations in the deep south between Mississippi and Alabama provide more visual information than in the previous map. Overall, this is a richer representation of the data distribution.\n",
"\n",
"One notable drawback of this representation is the legend. Because the lower bins are smaller, the numerical values overlap, making them unreadable."
Conengmo marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ac8ccec",
"metadata": {},
"outputs": [],
"source": [
"m = folium.Map(location=[38, -96], zoom_start=4)\n",
"\n",
"folium.Choropleth(\n",
" geo_data=geo_json_data,\n",
" data=labor_force,\n",
" columns=['State', clf],\n",
" key_on='id',\n",
" use_jenks=True,\n",
" fill_color='RdBu',\n",
").add_to(m)\n",
"\n",
"m"
]
},
{
"cell_type": "markdown",
"id": "b40d80a6",
"metadata": {},
"source": [
"Naturally, the user of Jenks Natural Breaks Optimization is incompatible with explicitly defined bins. If a user attempts both use Jenks breaks and explicitly define their bins, this results in a ValueError."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "051d055a",
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" m = folium.Map(location=[38, -96], zoom_start=4)\n",
"\n",
" folium.Choropleth(\n",
" geo_data=geo_json_data,\n",
" data=labor_force,\n",
" columns=['State', clf],\n",
" key_on='id',\n",
" use_jenks=True,\n",
" fill_color='RdBu',\n",
" bins=[1_000_000, 5_000_000, 10_000_000]\n",
" ).add_to(m)\n",
"\n",
" m\n",
"except ValueError as value_error:\n",
" print(value_error)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d17f7221",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
10 changes: 9 additions & 1 deletion folium/features.py
Original file line number Diff line number Diff line change
Expand Up @@ -1174,6 +1174,7 @@ def __init__(self, geo_data, data=None, columns=None, key_on=None, # noqa
line_weight=1, line_opacity=1, name=None, legend_name='',
overlay=True, control=True, show=True,
topojson=None, smooth_factor=None, highlight=None,
use_jenks=False,
Conengmo marked this conversation as resolved.
Show resolved Hide resolved
**kwargs):
super(Choropleth, self).__init__(name=name, overlay=overlay,
control=control, show=show)
Expand Down Expand Up @@ -1212,7 +1213,14 @@ def __init__(self, geo_data, data=None, columns=None, key_on=None, # noqa
if color_data is not None and key_on is not None:
real_values = np.array(list(color_data.values()))
real_values = real_values[~np.isnan(real_values)]
_, bin_edges = np.histogram(real_values, bins=bins)
if use_jenks:
from jenkspy import jenks_breaks

if not isinstance(bins, int):
raise ValueError(f'bins value must be an integer. Invalid value "{bins}" received.')
bin_edges = np.array(jenks_breaks(real_values, bins))
Conengmo marked this conversation as resolved.
Show resolved Hide resolved
else:
_, bin_edges = np.histogram(real_values, bins=bins)

bins_min, bins_max = min(bin_edges), max(bin_edges)
if np.any((real_values < bins_min) | (real_values > bins_max)):
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
branca>=0.3.0
jinja2>=2.9
numpy
requests
requests
Conengmo marked this conversation as resolved.
Show resolved Hide resolved