Monthly Python Data Engineering, April 2026

Monthly news from the Python Data Engineering world.

May 03, 2026

Hi and welcome to this new issue of the newsletter!

This is the first issue of the newsletter with significant additions and removals to the projects the newsletter covers. As the Python ecosystem shifts, I’m trying to keep the newsletter relevant to the kind of projects that data engineers care about. So less visualization and more data munging. So this issue has a bit less of Shiny and Streamlit and a bit more of ADBC and SQLGlot. I’ll try my best to pick the news that are most interesting for data engineer and avoid wasting your already overwhelmed email with unnecessary details.

Want to know more about me and why I curate this newsletter?
Check out my personal website at https://alessandro.molina.fyi/

Want to signal interesting libraries and frameworks for the newsletter?
Reply to the newsletter email at alessandromolina@substack.com

Key Highlights

This month, the biggest shifts were in query execution and dataframe scale.
Polars 1.40.0 materially expands its streaming engine with grouped as-of joins, more window and aggregation lowering, and a new lock-free memory manager with spill-to-disk, pushing larger lazy workloads further out of core. DataFusion 53.0.0 and Comet 0.15.0 also stand out for making Arrow- and Spark-based query stacks faster and more efficient, with deeper Parquet pruning, faster planning, a default native Iceberg reader, and a reported 2x TPC-H gain. On the GPU side, cuDF 26.04.00 pairs broader Python compatibility with major execution improvements, including dynamic planning by default and a 500x faster cudf.pandas attribute lookup path.

News

Apache Spark 4.2.0-preview5 is out as an early test build for the upcoming Spark 4.2.0 release, with the project explicitly warning that neither APIs nor functionality are stable yet. The announcement does not publish concrete SQL, runtime, or PySpark changes, but it does make the preview docs available and positions this cut, following Spark 4.2.0-preview4, as a chance for teams running Spark-based Python data pipelines to validate compatibility and report regressions before the final release.
sqlglot 30.6.0 makes a breaking internal change by compiling the Python generator with mypyc, which should matter to teams that generate or transpile a lot of SQL in Python, and it also fixes Postgres CREATE TRIGGER handling for dotted function calls. The recent April releases around it also tighten correctness and cross-dialect behavior in places data engineers hit often: sqlglot stops eliminating semi/anti joins, QUALIFY, and FULL OUTER JOIN in StarRocks, adds BigQuery LOAD DATA FROM FILES and typed AI scalar function nodes, and introduces a parser flag to control AST size when large query trees become expensive to handle.
fsspec 2026.4.0 tightens a few filesystem edges that matter in data pipelines: HTTP pipe_file now encodes URLs correctly, WholeFileCache fixes cat_file and cat_ranges, and archive handling closes tar and zip contexts properly. It also expands protocol handling by allowing multiple local protocols and adding delete plus write_test for DirFS, which should make local and cached filesystem behavior more predictable in tests and operational code; there is also a new usage warning for HTTP FS and an ADL message update following that backend’s retirement.
Velox Blog: FlatMapVector Adoption for Scaling High-Performance AI/ML Data Pre-Processing introduces a native in-memory flat-map encoding for Velox map types, aimed at warehouse tables with very wide feature maps where decoding whole maps is too expensive; Meta reports 5.1x to 17.3x faster table scans, 1.4x to 1.7x faster writes, and much lower memory use by avoiding map-to-row conversion overhead. The same release train also pushes Velox further into indexed and composable query infrastructure: Nimble’s new cluster index embeds a three-level index inside columnar files for O(log n) point lookups and range scans, with Presto index joins and a zero-decode projection path for ZippyDB, while Axiom builds a reusable C++ stack above Velox so the same optimizer, type system, and function registry can drive local, streaming, and batch runtimes with consistent semantics. There is also a useful correctness warning for engine builders: Velox shows why NULLIF cannot be safely rewritten as IF for non-deterministic expressions like rand(), because duplicate evaluation can produce impossible results.
Apache DataFusion Comet 0.15.0 pushes Spark acceleration forward with a reported 2x TPC-H speedup at SF1000, lower JVM/native overhead in shuffle and broadcast paths, shared native memory pools to reduce OOMs, and cached object-store lookups plus faster range reads for file-heavy workloads. It also turns the native Iceberg reader on by default, bringing dynamic partition pruning, better classloader handling, and more stable Iceberg scans, while adding native support for functions such as get_json_object, sort_array, LEAD/LAG IGNORE NULLS, and aggregate FILTER clauses. Underneath, DataFusion 53.0.0 adds LIMIT-aware Parquet row-group pruning, deeper filter and nested-field pushdown, much faster planning for low-latency query services, direct JSON-array reads, and some breaking optimizer and physical-plan API changes, so Python teams building on Spark, Iceberg, and Arrow-backed query stacks get both better scan efficiency and a bit more upgrade surface to check.
Apache Arrow 24.0.0 pushes PyArrow forward in a few practical places: arrays and scalars now support arithmetic directly, the build moves to scikit-build-core, and Windows wheels can now include AzureFileSystem, which removes some platform friction for teams moving data through cloud object storage. On the storage side it adds Parquet bloom filter write support and encrypted bloom filter reads, plus an lz4_raw compression alias, while the release also fixes a pyarrow.compute.if_else segfault, UUID inference gaps, duplicate CSV headers when the first batch is empty, and a pandas 3.0 test breakage, so this one is as much about safer upgrades as new features.
Narwhals v2.20.0 adds two concrete API pieces for cross-dataframe work: when/then chaining for building conditional expressions more naturally, and a new top-level struct function for composing structured values across supported backends. Together with v2.19.0, which added nw.corr and let str.contains accept other expressions or series on Polars and SQL-like backends, this keeps pushing Narwhals toward richer expression portability, with less backend-specific branching in Python data pipeline code.
Polars 1.40.1 is a small follow-up, but it fixes a real GroupBy correctness issue around having predicates, tightens append(upcast=False) to fail instead of silently widening, and adds maintain_order to merge_sorted for more predictable merge behavior in ordered pipelines. The bigger 1.40.0 release is where the engine work lands: streaming grouped as-of joins, streaming interpolate, cov, corr, strptime, and more window and aggregation lowering, plus a lock-free memory manager with spill-to-disk and out-of-core multiplexing for larger workloads. It also adds streaming PyArrow dataset sources and pl.merge_sorted across multiple frames, while deprecating the dataframe interchange protocol, so teams using Arrow-heavy ingestion or large lazy queries get more headroom but should watch that compatibility change.
marimo 0.23.4 sharpens the notebook layer around data apps: it updates Altair 6.1.0 and Vega-Lite 6.4.1 support, makes top-K and editable filter pills more consistent, and fixes a DuckDB INET type handling edge case that could break schema display in mixed backends. Across the recent 0.23 releases, marimo also hardened the runtime in ways Python data teams will notice in day-to-day work, with msgspec replacing pickle for IPC serialization, a new DataFusionFormatter, preserved column order and cleaner exports in table workflows, and a fix for the terminal WebSocket auth bypass in edit mode; if you use marimo for internal data tools, that mix means fewer rough edges in browser data exploration and less risk in deployed notebook environments.
apache/arrow-adbc 23 pushes the Python driver manager forward with GetStatistics support, profile-based connections via connect(profile="foo"), broader init type handling, and new profile discovery from venv/etc/adbc/profiles, which makes shared connection setup less ad hoc in deployed environments. Across the wider ADBC stack, the same release adds a bulk-ingest convenience API, a connection profile interface and profile/manifest consistency work in the driver managers, plus better PostgreSQL decimal-to-numeric conversion and null-parameter binding, so Python users should see more predictable cross-driver behavior when moving Arrow data into databases. There is some migration risk if you depend on Rust-backed integration points, since 0.23.0 changes the RecordBatchReader return type for more caller flexibility, but the release also fixes crash, deadlock, and search-path issues that matter for reliability in production pipelines.
zarr-python v3.2.0 adds several storage and array-model changes that matter for Python data systems: experimental rectilinear variable-sized chunks, support for structured and struct extension dtypes, and new cast_value and scale_offset codecs. It also speeds up full shard writes and oindex, adds Python 3.14 compatibility while raising the minimum to Python 3.12, and removes deprecated creation modules, array .create methods, group methods, and the zarr_version parameter, so this is both a performance release and a migration point. On the reliability side, it fixes FsspecStore path normalization and leading-slash handling, ZipStore auto-open behavior, Windows uint32 handling, and NumPy default NaT handling, which should reduce edge-case storage bugs in production pipelines.
dbt-core v1.11.8 tightens a few edges that matter in real dbt deployments: compile and test now correctly require catalog support for REST catalog-linked databases and custom catalog integrations, DBT_ENGINE-prefixed environment variables are now picked up by the CLI, and dbt also allows deferral for UDFs with better node descriptions in UDF logging. It also improves config and migration behavior by catching missing + prefixes and other invalid dbt_project.yml keys more cleanly, allowing meta and docs under macro config, and exposing a --sqlparse CLI option after unpinning sqlparse, which gives teams a bit more control when parser limits become an operational issue.
delta-rs v0.32.0 tightens the engine under Python-facing Delta Lake workflows with more work on the next DataFusion TableProvider, faster log parsing, and a new post_commithook_properties option on DeltaTable.restore. For Python users, the practical value is mostly correctness and storage reliability: this release fixes merge file pruning for string partition IN (...) predicates, schema handling in DeltaScan, hangs in to_pyarrow_dataset() with moto-backed S3 mocks, Azure az:// path mismatches that could trigger “version already exists” errors, and a vacuum-after-compaction bug that could remove files needed for time travel.
cuDF v26.04.00 pushes the Python GPU dataframe stack forward on both compatibility and engine behavior: it adds Python 3.14 support, raises the floors to PyArrow 19 and NumPy 1.26, supports CuPy 14, and exposes more low-level functionality in Python including cudf.filter, column_nans_to_nulls, and named capture groups in string extraction. On the execution side, it restores multithreaded CSV reads, expands Parquet reader expression support with fixed-point predicate pushdown and decimal-width controls, enables dynamic planning by default with the rapidsmpf runtime, and reports a 500x faster attribute lookup path in cudf.pandas. There is some migration risk too: missing string values now render as None instead of <NA>, pandas nullable dtypes are preserved more aggressively, and several fixes land in joins, groupby math, chunked Parquet reads, multi-GPU device selection, and JIT kernels on non-default CUDA streams, which should make larger GPU ETL pipelines less fragile.
Lance v6.0.0-rc.2 is a storage and indexing-heavy release for Python data systems work: it adds segmented inverted index build and search, zonemap index segments, IVF_FLAT support for float16 and float64, vector partition search parallelism, and exposes batch_size_bytes, base-scoped store bindings, and has_stable_row_ids in Python-facing APIs. The release also changes internals in ways teams should notice before upgrading, including vendoring the tokenizer stack into Lance, generalized object-store credentials, direct credential vending without the Azure SDK or google-cloud-auth, planned blob reads with source-level coalescing, and cleanup of transaction files on failed commits. On the operational side it tightens correctness and throughput with serialized namespace manifest mutations, safer merge_insert batching and index-search filtering, JSON float64 detection fixes, lower manifest memory use, SIMD distance kernels, faster ARM RaBitQ distance, and less scheduler overhead on small reads.
LanceDB Python v0.31.0-beta.11 keeps pushing the new namespace model forward, most notably with manifest-enabled directory namespace mode, nested namespace operations when listing databases, and child-namespace support plus JSON serialization on LanceDBConnection. Across the v0.31.0-beta.x train, the Python package also tightened operational behavior with hostname verification enabled by default, removed the legacy Tantivy full-text search path, and fixed several namespace-table edge cases around schema-only creates, namespace-backed Rust connections, and materializing declared namespace tables. For Python data systems teams, this looks like a meaningful prerelease if you are testing multi-namespace layouts or managed deployments, but the namespace refactor that started in v0.31.0-beta.0 also signals some migration risk.
Pantab 5.3.0 is a Python compatibility update: it adds Python 3.14 support and drops Python 3.9 and 3.10. For teams moving data between pandas and Tableau Hyper files, that means you can bring pantab onto newer interpreter builds, but older 3.9/3.10-based jobs now need a Python upgrade before taking this release.
DuckDB 1.5.2 is a bugfix release, but it touches a lot of engine paths that matter in Python pipelines: it fixes WAL replay and recovery around empty checkpoint WAL files, row-group growth on repeated load-and-insert cycles for indexed tables, and several memory-safety problems including prepared statement reuse, ADBC races, and CSV buffer boundary reads. It also tightens query correctness with fixes for ASOF joins, TopN window elimination, common subplan optimization, DELETE RETURNING in the same transaction, and incorrect results from try inside if, while improving data interchange by inferring timezone-aware timestamps in read_json_auto, fixing multiple Arrow edge cases, and adding support for Snowflake-produced shredded VARIANT Parquet files plus partial VARIANT shredding on write. For teams moving to 1.5.2, the practical story is more reliable storage recovery, safer Arrow and ADBC integration, and fewer wrong-result cases in analytic queries without a headline API change.
datafusion-table-providers v0.11.0 pushes much harder on predicate and query pushdown: MongoDB gets broader pushdown support plus wider Arrow type coverage, and SQL providers now have more comprehensive pushdown support, with follow-up fixes for sort and limit pushdown. For Python users wiring DataFusion into external systems, this also restores MongoDB Python bindings with SRV support, updates ADBC pieces, and fixes a MySQL column-count mismatch that could panic BatchCoalescer, so remote scans and federated queries should be both more capable and less brittle.
pandera v0.31.1 mainly fixes a packaging problem so pandera[polars] can be imported without pulling in pandas, which matters if you use Pandera as a schema layer in Polars-first pipelines. The bigger shift came in v0.31.0: Pandera now validates xarray DataArray, Dataset, and DataTree objects with both schema and model APIs, adds first-class geopandas schema and model support, and expands IO serialization across supported backends, including direct serialization for DatasetModel and DataFrameModel. That release also tightened correctness around strict and ordered schema errors, preserved nullable and timezone-aware pandas metadata more reliably, and improved PySpark and Ibis error reporting, so teams using one validation layer across pandas, Polars, Spark, and array data get fewer backend-specific surprises.

Monthly Python Data Engineering

Monthly Python Data Engineering, April 2026

Monthly news from the Python Data Engineering world.

Key Highlights

News

Ready for more?