Releases: apache/beam
Beam 2.59.0 release
We are happy to present the new 2.59.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.59.0, check out the detailed release notes.
Highlights
- Added support for setting a configureable timeout when loading a model and performing inference in the RunInference transform using with_exception_handling (#32137)
- Initial experimental support for using Prism with the Java and Python SDKs
- Prism is presently targeting local testing usage, or other small scale execution.
- For Java, use 'PrismRunner', or 'TestPrismRunner' as an argument to the
--runner
flag. - For Python, use 'PrismRunner' as an argument to the
--runner
flag. - Go already uses Prism as the default local runner.
I/Os
- Improvements to the performance of BigqueryIO when using withPropagateSuccessfulStorageApiWrites(true) method (Java) (#31840).
- [Managed Iceberg] Added support for writing to partitioned tables (#32102)
- Update ClickHouseIO to use the latest version of the ClickHouse JDBC driver (#32228).
- Add ClickHouseIO dedicated User-Agent (#32252).
New Features / Improvements
- BigQuery endpoint can be overridden via PipelineOptions, this enables BigQuery emulators (Java) (#28149).
- Go SDK Minimum Go Version updated to 1.21 (#32092).
- [BigQueryIO] Added support for withFormatRecordOnFailureFunction() for STORAGE_WRITE_API and STORAGE_API_AT_LEAST_ONCE methods (Java) (#31354).
- Updated Go protobuf package to new version (Go) (#21515).
- Added support for setting a configureable timeout when loading a model and performing inference in the RunInference transform using with_exception_handling (#32137)
- Adds OrderedListState support for Java SDK via FnApi.
- Initial support for using Prism from the Python and Java SDKs.
Bugfixes
- Fixed incorrect service account impersonation flow for Python pipelines using BigQuery IOs (#32030).
- Auto-disable broken and meaningless
upload_graph
feature when using Dataflow Runner V2 (#32159). - (Python) Upgraded google-cloud-storage to version 2.18.2 to fix a data corruption issue (#32135).
- (Go) Fix corruption on State API writes. (#32245).
Known Issues
- Prism is under active development and does not yet support all pipelines. See #29650 for progress.
- In the 2.59.0 release, Prism passes most runner validations tests with the exceptions of pipelines using the following features:
OrderedListState, OnWindowExpiry (eg. GroupIntoBatches), CustomWindows, MergingWindowFns, Trigger and WindowingStrategy associated features, Bundle Finalization, Looping Timers, and some Coder related issues such as with Python combiner packing, and Java Schema transforms, and heterogenous flatten coders. Processing Time timers do not yet have real time support. - If your pipeline is having difficulty with the Python or Java direct runners, but runs well on Prism, please let us know.
- In the 2.59.0 release, Prism passes most runner validations tests with the exceptions of pipelines using the following features:
For the most up to date list of known issues, see https://github.com/apache/beam/blob/master/CHANGES.md
List of Contributors
According to git shortlog, the following people contributed to the 2.59.0 release. Thank you to all contributors!
Ahmed Abualsaud,Ahmet Altay,Andrew Crites,atask-g,Axel Magnuson,Ayush Pandey,Bartosz Zablocki,Chamikara Jayalath,cutiepie-10,Damon,Danny McCormick,dependabot[bot],Eddie Phillips,Francis O'Hara,Hyeonho Kim,Israel Herraiz,Jack McCluskey,Jaehyeon Kim,Jan Lukavský,Jeff Kinard,Jeffrey Kinard,jonathan-lemos,jrmccluskey,Kirill Berezin,Kiruphasankaran Nataraj,lahariguduru,liferoad,lostluck,Maciej Szwaja,Manit Gupta,Mark Zitnik,martin trieu,Naireen Hussain,Prerit Chandok,Radosław Stankiewicz,Rebecca Szper,Robert Bradshaw,Robert Burke,ron-gal,Sam Whittle,Sergei Lilichenko,Shunping Huang,Svetak Sundhar,Thiago Nunes,Timothy Itodo,tvalentyn,twosom,Vatsal,Vitaly Terentyev,Vlado Djerek,Yifan Ye,Yi Hu
Beam 2.58.1 release
We are happy to present the new 2.58.1 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
New Features / Improvements
- Fixed issue where KafkaIO Records read with
ReadFromKafkaViaSDF
are redistributed and may contain duplicates regardless of the configuration. This affects Java pipelines with Dataflow v2 runner and xlang pipelines reading from Kafka, (#32196)
Known Issues
- Large Dataflow graphs using runner v2, or pipelines explicitly enabling the
upload_graph
experiment, will fail at construction time (#32159). - Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue (#32169). The issue will be fixed in 2.59.0 (#32135). To work around this, update the google-cloud-storage package to version 2.18.2 or newer.
For the most up to date list of known issues, see https://github.com/apache/beam/blob/master/CHANGES.md
List of Contributors
According to git shortlog, the following people contributed to the 2.58.1 release. Thank you to all contributors!
Danny McCormick
Sam Whittle
Beam 2.58.0 release
We are happy to present the new 2.58.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information about changes in 2.58.0, check out the detailed release notes.
I/Os
New Features / Improvements
- Multiple RunInference instances can now share the same model instance by setting the model_identifier parameter (Python) (#31665).
- Added options to control the number of Storage API multiplexing connections (#31721)
- [BigQueryIO] Better handling for batch Storage Write API when it hits AppendRows throughput quota (#31837)
- [IcebergIO] All specified catalog properties are passed through to the connector (#31726)
- Removed a third-party LGPL dependency from the Go SDK (#31765).
- Support for
MapState
andSetState
when using Dataflow Runner v1 with Streaming Engine (Java) ([#18200])
Breaking Changes
- [IcebergIO]
IcebergCatalogConfig
was changed to support specifying catalog properties in a key-store fashion (#31726) - [SpannerIO] Added validation that query and table cannot be specified at the same time for
SpannerIO.read()
. PreviouslywithQuery
overrideswithTable
, if set (#24956).
Bug fixes
- [BigQueryIO] Fixed a bug in batch Storage Write API that frequently exhausted concurrent connections quota (#31710)
List of Contributors
According to git shortlog, the following people contributed to the 2.58.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Alexandre Moueddene
Alexey Romanenko
Andrew Crites
Bartosz Zablocki
Celeste Zeng
Chamikara Jayalath
Clay Johnson
Damon Douglass
Danny McCormick
Dilnaz Amanzholova
Florian Bernard
Francis O'Hara
George Ma
Israel Herraiz
Jack McCluskey
Jaehyeon Kim
James Roseman
Kenneth Knowles
Maciej Szwaja
Michel Davit
Minh Son Nguyen
Naireen
Niel Markwick
Oliver Cardoza
Robert Bradshaw
Robert Burke
Rohit Sinha
S. Veyrié
Sam Whittle
Shunping Huang
Svetak Sundhar
TongruiLi
Tony Tang
Valentyn Tymofieiev
Vitaly Terentyev
Yi Hu
Beam 2.57.0 Release
We are happy to present the new 2.57.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.57.0, check out the detailed release notes.
Highlights
I/Os
- Ensure that BigtableIO closes the reader streams (#31477).
New Features / Improvements
- Added Feast feature store handler for enrichment transform (Python) (#30957).
- BigQuery per-worker metrics are reported by default for Streaming Dataflow Jobs (Java) (#31015)
- Adds
inMemory()
variant of Java List and Map side inputs for more efficient lookups when the entire side input fits into memory. - Beam YAML now supports the jinja templating syntax.
Template variables can be passed with the (json-formatted)--jinja_variables
flag. - DataFrame API now supports pandas 2.1.x and adds 12 more string functions for Series.(#31185).
- Added BigQuery handler for enrichment transform (Python) (#31295)
- Disable soft delete policy when creating the default bucket for a project (Java) (#31324).
- Added
DoFn.SetupContextParam
andDoFn.BundleContextParam
which can be used
as a pythonDoFn.process
,Map
, orFlatMap
parameter to invoke a context
manager per DoFn setup or bundle (analogous to usingsetup
/teardown
orstart_bundle
/finish_bundle
respectively.) - Go SDK Prism Runner
- Pre-built Prism binaries are now part of the release and are available via the Github release page. (#29697).
- Some pipelines will work on Java and Python, but this is in part to prepare for real runner wrappers in 2.58.0
- ProcessingTime is now handled synthetically with TestStream pipelines and Non-TestStream pipelines, for fast test pipeline execution by default. (#30083).
- Prism does NOT yet support "real time" execution for this release.
- Improve processing for large elements to reduce the chances for exceeding 2GB protobuf limits (Python)([https://github.com//issues/31607]).
Breaking Changes
- Java's View.asList() side inputs are now optimized for iterating rather than
indexing when in the global window.
This new implementation still supports all (immutable) List methods as before,
but some of the random access methods like get() and size() will be slower.
To use the old implementation one can use View.asList().withRandomAccess(). - SchemaTransforms implemented with TypedSchemaTransformProvider now produce a
configuration Schema with snake_case naming convention
(#31374). This will make the following
cases problematic:- Running a pre-2.57.0 remote SDK pipeline containing a 2.57.0+ Java SchemaTransform,
and vice versa: - Running a 2.57.0+ remote SDK pipeline containing a pre-2.57.0 Java SchemaTransform
- All direct uses of Python's SchemaAwareExternalTransform
should be updated to use new snake_case parameter names.
- Running a pre-2.57.0 remote SDK pipeline containing a 2.57.0+ Java SchemaTransform,
- Upgraded Jackson Databind to 2.15.4 (Java) (#26743).
jackson-2.15 has known breaking changes. An important one is it imposed a buffer limit for parser.
If your custom PTransform/DoFn are affected, refer to #31580 for mitigation.
For the most up to date list of known issues, see https://github.com/apache/beam/blob/master/CHANGES.md
List of Contributors
According to git shortlog, the following people contributed to the 2.57.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Alexey Romanenko
Andrey Devyatkin
Anody Zhang
Arvind Ram
Ben Konz
Bruno Volpato
Celeste Zeng
Chamikara Jayalath
Claire McGinty
Colm O hEigeartaigh
Damon
Danny McCormick
Evan Galpin
Ferran Fernández Garrido
Florent Biville
Jack Dingilian
Jack McCluskey
Jan Lukavský
JayajP
Jeff Kinard
Jeffrey Kinard
John Casey
Justin Uang
Kenneth Knowles
Kevin Zhou
Liam Miller-Cushon
Maarten Vercruysse
Maciej Szwaja
Maja Kontrec Rönn
Marc hurabielle
Martin Trieu
Mattie Fu
Min Zhu
Naireen Hussain
Nick Anikin
Pablo Rodriguez Defino
Paul King
Priyans Desai
Radosław Stankiewicz
Rebecca Szper
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Rodrigo Bozzolo
RyuSA
Sam Rohde
Sam Whittle
Sergei Lilichenko
Shahar Epstein
Shunping Huang
Svetak Sundhar
Tomo Suzuki
Tony Tang
Valentyn Tymofieiev
Vincent Stollenwerk
Vineet Kumar
Vitaly Terentyev
Vlado Djerek
XQ Hu
Yi Hu
akashorabek
bzablocki
kberezin
Beam 2.56.0 release
We are happy to present the new 2.56.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.56.0, check out the detailed release notes.
Highlights
- Added FlinkRunner for Flink 1.17, removed support for Flink 1.12 and 1.13. Previous version of Pipeline running on Flink 1.16 and below can be upgraded to 1.17, if the Pipeline is first updated to Beam 2.56.0 with the same Flink version. After Pipeline runs with Beam 2.56.0, it should be possible to upgrade to FlinkRunner with Flink 1.17. (#29939)
- New Managed I/O Java API (#30830).
- New Ordered Processing PTransform added for processing order-sensitive stateful data (#30735).
I/Os
- Upgraded Avro version to 1.11.3, kafka-avro-serializer and kafka-schema-registry-client versions to 7.6.0 (Java) (#30638).
The newer Avro package is known to have breaking changes. If you are affected, you can keep pinned to older Avro versions which are also tested with Beam. - Iceberg read/write support is available through the new Managed I/O Java API (#30830).
New Features / Improvements
- Profiling of Cythonized code has been disabled by default. This might improve performance for some Python pipelines (#30938).
- Bigtable enrichment handler now accepts a custom function to build a composite row key. (Python) (#30974).
Breaking Changes
- Default consumer polling timeout for KafkaIO.Read was increased from 1 second to 2 seconds. Use KafkaIO.read().withConsumerPollingTimeout(Duration duration) to configure this timeout value when necessary (#30870).
- Python Dataflow users no longer need to manually specify --streaming for pipelines using unbounded sources such as ReadFromPubSub.
Bugfixes
- Fixed locking issue when shutting down inactive bundle processors. Symptoms of this issue include slowness or stuckness in long-running jobs (Python) (#30679).
- Fixed logging issue that caused silecing the pip output when installing of dependencies provided in
--requirements_file
(Python).
List of Contributors
According to git shortlog, the following people contributed to the 2.56.0 release. Thank you to all contributors!
Abacn
Ahmed Abualsaud
Andrei Gurau
Andrey Devyatkin
Aravind Pedapudi
Arun Pandian
Arvind Ram
Bartosz Zablocki
Brachi Packter
Byron Ellis
Chamikara Jayalath
Clement DAL PALU
Damon
Danny McCormick
Daria Bezkorovaina
Dip Patel
Evan Burrell
Hai Joey Tran
Jack McCluskey
Jan Lukavský
JayajP
Jeff Kinard
Julien Tournay
Kenneth Knowles
Luís Bianchin
Maciej Szwaja
Melody Shen
Oleh Borysevych
Pablo Estrada
Rebecca Szper
Ritesh Ghorse
Robert Bradshaw
Sam Whittle
Sergei Lilichenko
Shahar Epstein
Shunping Huang
Svetak Sundhar
Timothy Itodo
Veronica Wasson
Vitaly Terentyev
Vlado Djerek
Yi Hu
akashorabek
bzablocki
clmccart
damccorm
dependabot[bot]
dmitryor
github-actions[bot]
liferoad
martin trieu
tvalentyn
xianhualiu
Beam 2.55.1 release
Bugfixes
- Fixed issue that broke WriteToJson in languages other than Java (X-lang) (#30776).
Beam 2.55.0 release
We are happy to present the new 2.55.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.55.0, check out the detailed release notes.
Highlights
- The Python SDK will now include automatically generated wrappers for external Java transforms! (#29834)
I/Os
- Added support for handling bad records to BigQueryIO (#30081).
- Full Support for Storage Read and Write APIs
- Partial Support for File Loads (Failures writing to files supported, failures loading files to BQ unsupported)
- No Support for Extract or Streaming Inserts
- Added support for handling bad records to PubSubIO (#30372).
- Support is not available for handling schema mismatches, and enabling error handling for writing to Pub/Sub topics with schemas is not recommended
--enableBundling
pipeline option for BigQueryIO DIRECT_READ is replaced by--enableStorageReadApiV2
. Both were considered experimental and subject to change (Java) (#26354).
New Features / Improvements
- Allow writing clustered and not time-partitioned BigQuery tables (Java) (#30094).
- Redis cache support added to RequestResponseIO and Enrichment transform (Python) (#30307)
- Merged
sdks/java/fn-execution
andrunners/core-construction-java
into the main SDK. These artifacts were never meant for users, but noting
that they no longer exist. These are steps to bring portability into the core SDK alongside all other core functionality. - Added Vertex AI Feature Store handler for Enrichment transform (Python) (#30388)
Breaking Changes
- Arrow version was bumped to 15.0.0 from 5.0.0 (#30181).
- Go SDK users who build custom worker containers may run into issues with the move to distroless containers as a base (see Security Fixes).
- The issue stems from distroless containers lacking additional tools, which current custom container processes may rely on.
- See https://beam.apache.org/documentation/runtime/environments/#from-scratch-go for instructions on building and using a custom container.
- Python SDK has changed the default value for the
--max_cache_memory_usage_mb
pipeline option from 100 to 0. This option was first introduced in the 2.52.0 SDK version. This change restores the behavior of the 2.51.0 SDK, which does not use the state cache. If your pipeline uses iterable side inputs views, consider increasing the cache size by setting the option manually. (#30360).
Deprecations
- N/A
Bug fixes
- Fixed
SpannerIO.readChangeStream
to support propagating credentials from pipeline options
to thegetDialect
calls for authenticating with Spanner (Java) (#30361). - Reduced the number of HTTP requests in GCSIO function calls (Python) (#30205)
Security Fixes
- Go SDK base container image moved to distroless/base-nossl-debian12, reducing vulnerable container surface to kernel and glibc (#30011).
Known Issues
- In Python pipelines, when shutting down inactive bundle processors, shutdown logic can overaggressively hold the lock, blocking acceptance of new work. Symptoms of this issue include slowness or stuckness in long-running jobs. Fixed in 2.56.0 (#30679).
List of Contributors
According to git shortlog, the following people contributed to the {$RELEASE_VERSION} release. Thank you to all contributors!
Ahmed Abualsaud
Anand Inguva
Andrew Crites
Andrey Devyatkin
Arun Pandian
Arvind Ram
Chamikara Jayalath
Chris Gray
Claire McGinty
Damon Douglas
Dan Ellis
Danny McCormick
Daria Bezkorovaina
Dima I
Edward Cui
Ferran Fernández Garrido
GStravinsky
Jan Lukavský
Jason Mitchell
JayajP
Jeff Kinard
Jeffrey Kinard
Kenneth Knowles
Mattie Fu
Michel Davit
Oleh Borysevych
Ritesh Ghorse
Ritesh Tarway
Robert Bradshaw
Robert Burke
Sam Whittle
Scott Strong
Shunping Huang
Steven van Rossum
Svetak Sundhar
Talat UYARER
Ukjae Jeong (Jay)
Vitaly Terentyev
Vlado Djerek
Yi Hu
akashorabek
case-k
clmccart
dengwe1
dhruvdua
hardshah
johnjcasey
liferoad
martin trieu
tvalentyn
Beam 2.54.0 release
We are happy to present the new 2.54.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.54.0, check out the detailed release notes.
Highlights
- Enrichment Transform along with GCP BigTable handler added to Python SDK (#30001).
- Beam Java Batch pipelines run on Google Cloud Dataflow will default to the Portable (Runner V2)[https://cloud.google.com/dataflow/docs/runner-v2] starting with this version. (All other languages are already on Runner V2.)
- This change is still rolling out to the Dataflow service, see (Runner V2 documentation)[https://cloud.google.com/dataflow/docs/runner-v2] for how to enable or disable it intentionally.
I/Os
- Added support for writing to BigQuery dynamic destinations with Python's Storage Write API (#30045)
- Adding support for Tuples DataType in ClickHouse (Java) (#29715).
- Added support for handling bad records to FileIO, TextIO, AvroIO (#29670).
- Added support for handling bad records to BigtableIO (#29885).
New Features / Improvements
- Enrichment Transform along with GCP BigTable handler added to Python SDK (#30001).
Breaking Changes
- N/A
Deprecations
- N/A
Bugfixes
- Fixed a memory leak affecting some Go SDK since 2.46.0. (#28142)
Security Fixes
- N/A
Known Issues
- N/A
List of Contributors
According to git shortlog, the following people contributed to the 2.54.0 release. Thank you to all contributors!
Ahmed Abualsaud
Alexey Romanenko
Anand Inguva
Andrew Crites
Arun Pandian
Bruno Volpato
caneff
Chamikara Jayalath
Changyu Li
Cheskel Twersky
Claire McGinty
clmccart
Damon
Danny McCormick
dependabot[bot]
Edward Cheng
Ferran Fernández Garrido
Hai Joey Tran
hugo-syn
Issac
Jack McCluskey
Jan Lukavský
JayajP
Jeffrey Kinard
Jerry Wang
Jing
Joey Tran
johnjcasey
Kenneth Knowles
Knut Olav Løite
liferoad
Marc
Mark Zitnik
martin trieu
Mattie Fu
Naireen Hussain
Neeraj Bansal
Niel Markwick
Oleh Borysevych
pablo rodriguez defino
Rebecca Szper
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Sam Whittle
Shunping Huang
Svetak Sundhar
S. Veyrié
Talat UYARER
tvalentyn
Vlado Djerek
Yi Hu
Zechen Jian
Beam 2.53.0 release
We are happy to present the new 2.53.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.53.0, check out the detailed release notes.
Highlights
- Python streaming users that use 2.47.0 and newer versions of Beam should update to version 2.53.0, which fixes a known issue: (#27330).
I/Os
- TextIO now supports skipping multiple header lines (Java) (#17990).
- Python GCSIO is now implemented with GCP GCS Client instead of apitools (#25676)
- Adding support for LowCardinality DataType in ClickHouse (Java) (#29533).
- Added support for handling bad records to KafkaIO (Java) (#29546)
- Add support for generating text embeddings in MLTransform for Vertex AI and Hugging Face Hub models.(#29564)
- NATS IO connector added (Go) (#29000).
New Features / Improvements
- The Python SDK now type checks
collections.abc.Collections
types properly. Some type hints that were erroneously allowed by the SDK may now fail. (#29272) - Running multi-language pipelines locally no longer requires Docker.
Instead, the same (generally auto-started) subprocess used to perform the
expansion can also be used as the cross-language worker. - Framework for adding Error Handlers to composite transforms added in Java (#29164).
- Python 3.11 images now include google-cloud-profiler (#29561).
Breaking Changes
- Upgraded to go 1.21.5 to build, fixing CVE-2023-45285 and CVE-2023-39326
Deprecations
- Euphoria DSL is deprecated and will be removed in a future release (not before 2.56.0) (#29451)
Bugfixes
- (Python) Fixed sporadic crashes in streaming pipelines that affected some users of 2.47.0 and newer SDKs (#27330).
- (Python) Fixed a bug that caused MLTransform to drop identical elements in the output PCollection (#29600).
List of Contributors
According to git shortlog, the following people contributed to the 2.53.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Alexey Romanenko
Anand Inguva
Arun Pandian
Balázs Németh
Bruno Volpato
Byron Ellis
Calvin Swenson Jr
Chamikara Jayalath
Clay Johnson
Damon
Danny McCormick
Ferran Fernández Garrido
Georgii Zemlianyi
Israel Herraiz
Jack McCluskey
Jacob Tomlinson
Jan Lukavský
JayajP
Jeffrey Kinard
Johanna Öjeling
Julian Braha
Julien Tournay
Kenneth Knowles
Lawrence Qiu
Mark Zitnik
Mattie Fu
Michel Davit
Mike Williamson
Naireen
Naireen Hussain
Niel Markwick
Pablo Estrada
Radosław Stankiewicz
Rebecca Szper
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Sam Rohde
Sam Whittle
Shunping Huang
Svetak Sundhar
Talat UYARER
Tom Stepp
Tony Tang
Vlado Djerek
Yi Hu
Zechen Jiang
clmccart
damccorm
darshan-sj
gabry.wu
johnjcasey
liferoad
lrakla
martin trieu
tvalentyn
Beam 2.52.0 release
We are happy to present the new 2.52.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.52.0, check out the detailed release notes.
Highlights
- Previously deprecated Avro-dependent code (Beam Release 2.46.0) has been finally removed from Java SDK "core" package.
Please, usebeam-sdks-java-extensions-avro
instead. This will allow to easily update Avro version in user code without
potential breaking changes in Beam "core" since the Beam Avro extension already supports the latest Avro versions and
should handle this. (#25252). - Publishing Java 21 SDK container images now supported as part of Apache Beam release process. (#28120)
- Direct Runner and Dataflow Runner support running pipelines on Java21 (experimental until tests fully setup). For other runners (Flink, Spark, Samza, etc) support status depend on runner projects.
New Features / Improvements
- Add
UseDataStreamForBatch
pipeline option to the Flink runner. When it is set to true, Flink runner will run batch
jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed
using the DataSet API. upload_graph
as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK (PR#28621.- state amd side input cache has been enabled to a default of 100 MB. Use
--max_cache_memory_usage_mb=X
to provide cache size for the user state API and side inputs. (Python) (#28770). - Beam YAML stable release. Beam pipelines can now be written using YAML and leverage the Beam YAML framework which includes a preliminary set of IO's and turnkey transforms. More information can be found in the YAML root folder and in the README.
Breaking Changes
org.apache.beam.sdk.io.CountingSource.CounterMark
uses customCounterMarkCoder
as a default coder since all Avro-dependent
classes finally moved toextensions/avro
. In case if it's still required to useAvroCoder
forCounterMark
, then,
as a workaround, a copy of "old"CountingSource
class should be placed into a project code and used directly
(#25252).- Renamed
host
tofirestoreHost
inFirestoreOptions
to avoid potential conflict of command line arguments (Java) (#29201).
Bugfixes
- Fixed "Desired bundle size 0 bytes must be greater than 0" in Java SDK's BigtableIO.BigtableSource when you have more cores than bytes to read (Java) #28793.
watch_file_pattern
arg of the RunInference arg had no effect prior to 2.52.0. To use the behavior of argwatch_file_pattern
prior to 2.52.0, follow the documentation at https://beam.apache.org/documentation/ml/side-input-updates/ and useWatchFilePattern
PTransform as a SideInput. (#28948)MLTransform
doesn't output artifacts such as min, max and quantiles. Instead,MLTransform
will add a feature to output these artifacts as human readable format - #29017. For now, to use the artifacts such as min and max that were produced by the earilerMLTransform
, useread_artifact_location
ofMLTransform
, which reads artifacts that were produced earlier in a differentMLTransform
(#29016)- Fixed a memory leak, which affected some long-running Python pipelines: #28246.
Security Fixes
- Fixed CVE-2023-39325 (Java/Python/Go) (#29118).
- Mitigated CVE-2023-47248 (Python) #29392.
List of Contributors
According to git shortlog, the following people contributed to the 2.52.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Aleksandr Dudko
Alexey Romanenko
Anand Inguva
Andrei Gurau
Andrey Devyatkin
BjornPrime
Bruno Volpato
Bulat
Chamikara Jayalath
Damon
Danny McCormick
Devansh Modi
Dominik Dębowczyk
Ferran Fernández Garrido
Hai Joey Tran
Israel Herraiz
Jack McCluskey
Jan Lukavský
JayajP
Jeff Kinard
Jeffrey Kinard
Jiangjie Qin
Jing
Joar Wandborg
Johanna Öjeling
Julien Tournay
Kanishk Karanawat
Kenneth Knowles
Kerry Donny-Clark
Luís Bianchin
Minbo Bae
Pranav Bhandari
Rebecca Szper
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
RyuSA
Shunping Huang
Steven van Rossum
Svetak Sundhar
Tony Tang
Vitaly Terentyev
Vivek Sumanth
Vlado Djerek
Yi Hu
aku019
brucearctor
caneff
damccorm
ddebowczyk92
dependabot[bot]
dpcollins-google
edman124
gabry.wu
illoise
johnjcasey
jonathan-lemos
kennknowles
liferoad
magicgoody
martin trieu
nancyxu123
pablo rodriguez defino
tvalentyn