-
Notifications
You must be signed in to change notification settings - Fork 3
Add landscape analysis #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
jarrodmillman
wants to merge
2
commits into
statistical-python:main
Choose a base branch
from
jarrodmillman:landscape
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
--- | ||
site: | ||
hide_toc: true | ||
hide_footer_links: true | ||
--- | ||
|
||
# 2024 Landscape Analysis | ||
|
||
Python is widely adopted in data science, and its use for statistics is expanding rapidly, particularly in education and applied research. | ||
The statistical ecosystem in Python is currently anchored by four major libraries: | ||
|
||
- [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html), which provides a comprehensive suite of probability distributions, summary statistics, and basic statistical tests; | ||
- [statsmodels](https://www.statsmodels.org/), which offers tools for econometrics, classical statistics, and statistical modeling—including linear and generalized linear models, time series analysis, and hypothesis testing; | ||
- [scikit-learn](https://scikit-learn.org/), which is best known for machine learning but also supports some statistical modeling, offering a consistent API for predictive analytics and data preprocessing; and | ||
- [seaborn](https://seaborn.pydata.org/), a library built on top of matplotlib that excels at creating informative and attractive statistical graphics, making it easier to visualize distributions, relationships, and trends in data. | ||
|
||
These core libraries are generally well-tested, reliable, and uphold high software engineering standards, making them trusted foundations for research and application. | ||
Libraries like scikit-learn are especially valued for their clean, consistent interfaces and their integration with the broader Python data stack, which streamlines workflows and enhances usability for both new and experienced users. | ||
|
||
While there are many smaller, specialized packages available, the ecosystem remains dominated by these large, general-purpose libraries. This concentration of resources ensures stability and quality but can also limit the visibility and adoption of innovative or niche statistical tools. | ||
As Python’s role in statistics continues to grow, fostering a more diverse and accessible ecosystem will be key to meeting the evolving needs of educators, researchers, and practitioners. | ||
|
||
# Relationship to Other Languages | ||
|
||
R remains the gold standard for statistics, with better branding, a more cohesive ecosystem, and more teaching resources. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And they have contributions from the methods developers. |
||
R's [tidyverse](https://www.tidyverse.org/) and [RStudio](https://posit.co/products/open-source/rstudio/) provide a smoother user experience for statistics, and CRAN offers a vast repository of statistical packages. | ||
|
||
:::{table} Python vs. R for Statistics | ||
:label: table | ||
:align: center | ||
|
||
| Aspect | Python (Scientific Python) | R (CRAN, tidyverse) | | ||
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | | ||
| Core Libraries | [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html), [statsmodels](https://www.statsmodels.org/), [scikit-learn](https://scikit-learn.org/) | [base R](https://www.r-project.org/), [tidyverse](https://www.tidyverse.org/), many CRAN packages | | ||
| User Experience | Fragmented, less cohesive | Cohesive, tidyverse pipelines, RStudio | | ||
| Teaching Resources | Improving, but less abundant | Extensive, beginner-friendly | | ||
| Community | Large, less connected in statistics | Strong, statistics-focused, welcoming | | ||
| Package Development | High barriers, less modularity | Easy, many small packages, dev tools | | ||
| Interoperability | Needs improvement (data structures, APIs) | Strong within tidyverse, RStudio | | ||
| Branding | Data science/machine learning focus | Statistics-focused | | ||
|
||
::: | ||
|
||
**Interoperability**: While some users switch between Python and R in their workflows, true interoperability is limited. | ||
Most projects use one language at a time, often leveraging R for data manipulation and Python for modeling or vice versa. | ||
|
||
**Other Platforms**: Tools like GraphPad Prism remain popular among practicing scientists for basic statistical analyses, indicating that neither Python nor R fully dominates all applied domains. | ||
|
||
# Weaknesses and Needs | ||
|
||
Despite Python's strengths, several challenges remain. | ||
|
||
- **Fragmentation**: The ecosystem is fragmented, with major libraries (e.g., statsmodels vs. scikit-learn) adopting incompatible APIs and workflows, leading to confusion for users and students. | ||
- **User Experience**: There is no central landing place or unified entry point for statistics in Python, unlike R's [tidyverse](https://www.tidyverse.org/) or RStudio, making it harder for newcomers to get started. | ||
- **Interoperability**: Data structures (such as those from [pandas](https://pandas.pydata.org/) and [NumPy](https://numpy.org/)) do not always work seamlessly across libraries, requiring conversions and leading to unpredictable function outputs compared to R's tidyverse pipelines. | ||
- **Teaching Resources**: Python lacks the abundance of user-friendly, statistics-focused tutorials and case studies found in the R community. | ||
- **Contributor Barriers**: Contributing to core libraries can be difficult due to high standards and lack of modularity. | ||
Small, specialized packages exist but are less visible and less widely used than in R. | ||
- **Statistical Methods Coverage**: Some advanced or niche statistical methods are missing or hard to find, especially compared to R's vast [CRAN](https://cran.r-project.org/) repository. | ||
- **Community and Culture**: The Python statistics community is less cohesive and connected than R's, which benefits from a strong identity and established events. | ||
|
||
# Conclusion | ||
|
||
Python's statistics ecosystem is powerful but fragmented, with significant opportunities for improvement in usability, interoperability, teaching resources, and community cohesion. | ||
While R remains the default for statistics, Python is gaining ground, especially as data science and machine learning continue to grow in influence. | ||
Stronger integration, better documentation, and a more unified vision could help Python become a true peer to R in the statistics domain. | ||
In particular, Python needs: | ||
|
||
- A unified, user-friendly interface for statistics. | ||
- Improved interoperability between core data structures and libraries. | ||
- More accessible teaching resources and case studies focused on statistics. | ||
- Lower barriers for contributors and greater visibility for specialized statistical packages. | ||
- Stronger community identity and central organization for statistics in Python. | ||
|
||
The Statistical Python project seeks to address these needs by fostering collaboration, sharing best practices, and building a sustainable, inclusive community. | ||
As a domain stack within the [Scientific Python project](https://scientific-python.org/), and with support from the NSF POSE Phase I grant, we are committed to making Python a premier platform for statistical computing, education, and research. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will also involve increasing developer participation in the core packages.