python-visualization · Conengmo · Nov 17, 2022 · Jan 26, 2022 · Oct 13, 2022 · Oct 13, 2022
diff --git a/examples/Jenks.ipynb b/examples/Jenks.ipynb
@@ -0,0 +1,191 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1bb0c3ba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pip install jenkspy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aa1bb64d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import folium\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import json\n",
+    "import requests\n",
+    "\n",
+    "import ssl\n",
+    "ssl._create_default_https_context = ssl._create_unverified_context"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c7fd1292",
+   "metadata": {},
+   "source": [
+    "# Integrating Jenks Natural Break Optimization\n",
+    "\n",
+    "Choropleths provide an easy way to visually see data distributions across geography. By default, folium uses the breaks created by numpy.histogram (np.histogram), which generally creates an evenly spaced quantiles.\n",
+    "\n",
+    "This works well enough for evenly distributed data, but for unevenly distributed data, these even quantiles can obscure more than they show. To demonstrate this, I have created maps showing the labor force of each US state.\n",
+    "\n",
+    "The data was taken from the county-level data and aggregated. Since our geographic data does not have areas representing Puerto Rico or the United States as a whole, I removed those entries while keeping Washington, D.C. in our data set. Already, looking at the first five states alphabetically, we can see that Alaska (AK) has a work force roughly 2% the size of California (CA)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a199cc25",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "url = (\n",
+    "    \"https://github.com/python-visualization/folium/main/examples/data\"\n",
+    ")\n",
+    "us_states = f\"{url}/us-states.json\"\n",
+    "\n",
+    "geo_json_data = json.loads(requests.get(us_states).text)\n",
+    "\n",
+    "county_data = pd.read_csv(f\"{url}/us_county_data.csv\")\n",
+    "clf = 'Civilian_labor_force_2011'\n",
+    "labor_force = county_data[['State', clf]][\n",
+    "    (county_data[clf].str.strip()!='') & (~county_data['State'].isin(['PR', 'US']))\n",
+    "]\n",
+    "labor_force[clf] = labor_force[clf].astype(int)\n",
+    "labor_force = labor_force.groupby('State').sum().reset_index()\n",
+    "\n",
+    "labor_force.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4b570c1f",
+   "metadata": {},
+   "source": [
+    "Using default breaks, most states are represented as being part of the bottom quantile. This distribution is similar to what we might expect if US states follow a Power Law or a Zipf distribution."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8b06b85d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "m = folium.Map(location=[38, -96], zoom_start=4)\n",
+    "\n",
+    "folium.Choropleth(\n",
+    "    geo_data=geo_json_data,\n",
+    "    data=labor_force,\n",
+    "    columns=['State', clf],\n",
+    "    key_on='id',\n",
+    "    fill_color='RdBu',\n",
+    ").add_to(m)\n",
+    "\n",
+    "m"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4a36162e",
+   "metadata": {},
+   "source": [
+    "However, when using Jenks natural Breaks Optimization, we now see more granular detail at the bottom of the distribution, where most of our states are located. The upper western states (Idaho, Montana, Wyoming and the Dakotas) are distinguished from their Midwestern and Mountain West neighbors to the south. Gradations in the deep south between Mississippi and Alabama provide more visual information than in the previous map. Overall, this is a richer representation of the data distribution.\n",
+    "\n",
+    "One notable drawback of this representation is the legend. Because the lower bins are smaller, the numerical values overlap, making them unreadable."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ac8ccec",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "m = folium.Map(location=[38, -96], zoom_start=4)\n",
+    "\n",
+    "folium.Choropleth(\n",
+    "    geo_data=geo_json_data,\n",
+    "    data=labor_force,\n",
+    "    columns=['State', clf],\n",
+    "    key_on='id',\n",
+    "    use_jenks=True,\n",
+    "    fill_color='RdBu',\n",
+    ").add_to(m)\n",
+    "\n",
+    "m"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b40d80a6",
+   "metadata": {},
+   "source": [
+    "Naturally, the user of Jenks Natural Breaks Optimization is incompatible with explicitly defined bins. If a user attempts both use Jenks breaks and explicitly define their bins, this results in a ValueError."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "051d055a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "try:\n",
+    "    m = folium.Map(location=[38, -96], zoom_start=4)\n",
+    "\n",
+    "    folium.Choropleth(\n",
+    "        geo_data=geo_json_data,\n",
+    "        data=labor_force,\n",
+    "        columns=['State', clf],\n",
+    "        key_on='id',\n",
+    "        use_jenks=True,\n",
+    "        fill_color='RdBu',\n",
+    "        bins=[1_000_000, 5_000_000, 10_000_000]\n",
+    "    ).add_to(m)\n",
+    "\n",
+    "    m\n",
+    "except ValueError as value_error:\n",
+    "    print(value_error)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d17f7221",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/folium/features.py b/folium/features.py
@@ -1174,6 +1174,7 @@ def __init__(self, geo_data, data=None, columns=None, key_on=None,  # noqa
                  line_weight=1, line_opacity=1, name=None, legend_name='',
                  overlay=True, control=True, show=True,
                  topojson=None, smooth_factor=None, highlight=None,
+                 use_jenks=False,
                  **kwargs):
         super(Choropleth, self).__init__(name=name, overlay=overlay,
                                          control=control, show=show)
@@ -1212,7 +1213,14 @@ def __init__(self, geo_data, data=None, columns=None, key_on=None,  # noqa
         if color_data is not None and key_on is not None:
             real_values = np.array(list(color_data.values()))
             real_values = real_values[~np.isnan(real_values)]
-            _, bin_edges = np.histogram(real_values, bins=bins)
+            if use_jenks:
+                from jenkspy import jenks_breaks
+
+                if not isinstance(bins, int):
+                    raise ValueError(f'bins value must be an integer. Invalid value "{bins}" received.')
+                bin_edges = np.array(jenks_breaks(real_values, bins))
+            else:
+                _, bin_edges = np.histogram(real_values, bins=bins)
 
             bins_min, bins_max = min(bin_edges), max(bin_edges)
             if np.any((real_values < bins_min) | (real_values > bins_max)):

diff --git a/requirements.txt b/requirements.txt
@@ -1,4 +1,4 @@
 branca>=0.3.0
 jinja2>=2.9
 numpy
-requests
+requests