Monthly Python Data Engineering, April 2026
Monthly news from the Python Data Engineering world.
Hi and welcome to this new issue of the newsletter!
This is the first issue of the newsletter with significant additions and removals to the projects the newsletter covers. As the Python ecosystem shifts, I’m trying to keep the newsletter relevant to the kind of projects that data engineers care about. So less visualization and more data munging. So this issue has a bit less of Shiny and Streamlit and a bit more of ADBC and SQLGlot. I’ll try my best to pick the news that are most interesting for data engineer and avoid wasting your already overwhelmed email with unnecessary details.
Want to know more about me and why I curate this newsletter?
Check out my personal website at https://alessandro.molina.fyi/
Want to signal interesting libraries and frameworks for the newsletter?
Reply to the newsletter email at alessandromolina@substack.com
Key Highlights
This month, the biggest shifts were in query execution and dataframe scale.
Polars 1.40.0 materially expands its streaming engine with grouped as-of joins, more window and aggregation lowering, and a new lock-free memory manager with spill-to-disk, pushing larger lazy workloads further out of core. DataFusion 53.0.0 and Comet 0.15.0 also stand out for making Arrow- and Spark-based query stacks faster and more efficient, with deeper Parquet pruning, faster planning, a default native Iceberg reader, and a reported 2x TPC-H gain. On the GPU side, cuDF 26.04.00 pairs broader Python compatibility with major execution improvements, including dynamic planning by default and a 500x faster cudf.pandas attribute lookup path.
News
Apache Spark 4.2.0-preview5 is out as an early test build for the upcoming Spark 4.2.0 release, with the project explicitly warning that neither APIs nor functionality are stable yet. The announcement does not publish concrete SQL, runtime, or PySpark changes, but it does make the preview docs available and positions this cut, following Spark 4.2.0-preview4, as a chance for teams running Spark-based Python data pipelines to validate compatibility and report regressions before the final release.
sqlglot 30.6.0 makes a breaking internal change by compiling the Python generator with mypyc, which should matter to teams that generate or transpile a lot of SQL in Python, and it also fixes Postgres
CREATE TRIGGERhandling for dotted function calls. The recent April releases around it also tighten correctness and cross-dialect behavior in places data engineers hit often: sqlglot stops eliminating semi/anti joins,QUALIFY, andFULL OUTER JOINin StarRocks, adds BigQueryLOAD DATA FROM FILESand typed AI scalar function nodes, and introduces a parser flag to control AST size when large query trees become expensive to handle.fsspec 2026.4.0 tightens a few filesystem edges that matter in data pipelines: HTTP
pipe_filenow encodes URLs correctly,WholeFileCachefixescat_fileandcat_ranges, and archive handling closes tar and zip contexts properly. It also expands protocol handling by allowing multiple local protocols and addingdeletepluswrite_testforDirFS, which should make local and cached filesystem behavior more predictable in tests and operational code; there is also a new usage warning for HTTP FS and an ADL message update following that backend’s retirement.Velox Blog: FlatMapVector Adoption for Scaling High-Performance AI/ML Data Pre-Processing introduces a native in-memory flat-map encoding for Velox map types, aimed at warehouse tables with very wide feature maps where decoding whole maps is too expensive; Meta reports 5.1x to 17.3x faster table scans, 1.4x to 1.7x faster writes, and much lower memory use by avoiding map-to-row conversion overhead. The same release train also pushes Velox further into indexed and composable query infrastructure: Nimble’s new cluster index embeds a three-level index inside columnar files for O(log n) point lookups and range scans, with Presto index joins and a zero-decode projection path for ZippyDB, while Axiom builds a reusable C++ stack above Velox so the same optimizer, type system, and function registry can drive local, streaming, and batch runtimes with consistent semantics. There is also a useful correctness warning for engine builders: Velox shows why
NULLIFcannot be safely rewritten asIFfor non-deterministic expressions likerand(), because duplicate evaluation can produce impossible results.Apache DataFusion Comet 0.15.0 pushes Spark acceleration forward with a reported 2x TPC-H speedup at SF1000, lower JVM/native overhead in shuffle and broadcast paths, shared native memory pools to reduce OOMs, and cached object-store lookups plus faster range reads for file-heavy workloads. It also turns the native Iceberg reader on by default, bringing dynamic partition pruning, better classloader handling, and more stable Iceberg scans, while adding native support for functions such as
get_json_object,sort_array,LEAD/LAG IGNORE NULLS, and aggregateFILTERclauses. Underneath, DataFusion 53.0.0 adds LIMIT-aware Parquet row-group pruning, deeper filter and nested-field pushdown, much faster planning for low-latency query services, direct JSON-array reads, and some breaking optimizer and physical-plan API changes, so Python teams building on Spark, Iceberg, and Arrow-backed query stacks get both better scan efficiency and a bit more upgrade surface to check.Apache Arrow 24.0.0 pushes PyArrow forward in a few practical places: arrays and scalars now support arithmetic directly, the build moves to
scikit-build-core, and Windows wheels can now includeAzureFileSystem, which removes some platform friction for teams moving data through cloud object storage. On the storage side it adds Parquet bloom filter write support and encrypted bloom filter reads, plus anlz4_rawcompression alias, while the release also fixes apyarrow.compute.if_elsesegfault, UUID inference gaps, duplicate CSV headers when the first batch is empty, and a pandas 3.0 test breakage, so this one is as much about safer upgrades as new features.Narwhals v2.20.0 adds two concrete API pieces for cross-dataframe work:
when/thenchaining for building conditional expressions more naturally, and a new top-levelstructfunction for composing structured values across supported backends. Together with v2.19.0, which addednw.corrand letstr.containsaccept other expressions or series on Polars and SQL-like backends, this keeps pushing Narwhals toward richer expression portability, with less backend-specific branching in Python data pipeline code.Polars 1.40.1 is a small follow-up, but it fixes a real
GroupBycorrectness issue aroundhavingpredicates, tightensappend(upcast=False)to fail instead of silently widening, and addsmaintain_ordertomerge_sortedfor more predictable merge behavior in ordered pipelines. The bigger 1.40.0 release is where the engine work lands: streaming grouped as-of joins, streaminginterpolate,cov,corr,strptime, and more window and aggregation lowering, plus a lock-free memory manager with spill-to-disk and out-of-core multiplexing for larger workloads. It also adds streaming PyArrow dataset sources andpl.merge_sortedacross multiple frames, while deprecating the dataframe interchange protocol, so teams using Arrow-heavy ingestion or large lazy queries get more headroom but should watch that compatibility change.marimo 0.23.4 sharpens the notebook layer around data apps: it updates Altair 6.1.0 and Vega-Lite 6.4.1 support, makes top-K and editable filter pills more consistent, and fixes a DuckDB
INETtype handling edge case that could break schema display in mixed backends. Across the recent 0.23 releases, marimo also hardened the runtime in ways Python data teams will notice in day-to-day work, withmsgspecreplacing pickle for IPC serialization, a newDataFusionFormatter, preserved column order and cleaner exports in table workflows, and a fix for the terminal WebSocket auth bypass in edit mode; if you use marimo for internal data tools, that mix means fewer rough edges in browser data exploration and less risk in deployed notebook environments.apache/arrow-adbc 23 pushes the Python driver manager forward with
GetStatisticssupport, profile-based connections viaconnect(profile="foo"), broader init type handling, and new profile discovery fromvenv/etc/adbc/profiles, which makes shared connection setup less ad hoc in deployed environments. Across the wider ADBC stack, the same release adds a bulk-ingest convenience API, a connection profile interface and profile/manifest consistency work in the driver managers, plus better PostgreSQL decimal-to-numeric conversion and null-parameter binding, so Python users should see more predictable cross-driver behavior when moving Arrow data into databases. There is some migration risk if you depend on Rust-backed integration points, since 0.23.0 changes theRecordBatchReaderreturn type for more caller flexibility, but the release also fixes crash, deadlock, and search-path issues that matter for reliability in production pipelines.zarr-python v3.2.0 adds several storage and array-model changes that matter for Python data systems: experimental rectilinear variable-sized chunks, support for
structuredandstructextension dtypes, and newcast_valueandscale_offsetcodecs. It also speeds up full shard writes andoindex, adds Python 3.14 compatibility while raising the minimum to Python 3.12, and removes deprecated creation modules, array.createmethods, group methods, and thezarr_versionparameter, so this is both a performance release and a migration point. On the reliability side, it fixes FsspecStore path normalization and leading-slash handling, ZipStore auto-open behavior, Windowsuint32handling, and NumPy defaultNaThandling, which should reduce edge-case storage bugs in production pipelines.dbt-core v1.11.8 tightens a few edges that matter in real dbt deployments:
compileandtestnow correctly require catalog support for REST catalog-linked databases and custom catalog integrations,DBT_ENGINE-prefixed environment variables are now picked up by the CLI, and dbt also allows deferral for UDFs with better node descriptions in UDF logging. It also improves config and migration behavior by catching missing+prefixes and other invaliddbt_project.ymlkeys more cleanly, allowingmetaanddocsunder macro config, and exposing a--sqlparseCLI option after unpinningsqlparse, which gives teams a bit more control when parser limits become an operational issue.delta-rs v0.32.0 tightens the engine under Python-facing Delta Lake workflows with more work on the next DataFusion
TableProvider, faster log parsing, and a newpost_commithook_propertiesoption onDeltaTable.restore. For Python users, the practical value is mostly correctness and storage reliability: this release fixes merge file pruning for string partitionIN (...)predicates, schema handling inDeltaScan, hangs into_pyarrow_dataset()with moto-backed S3 mocks, Azureaz://path mismatches that could trigger “version already exists” errors, and a vacuum-after-compaction bug that could remove files needed for time travel.cuDF v26.04.00 pushes the Python GPU dataframe stack forward on both compatibility and engine behavior: it adds Python 3.14 support, raises the floors to PyArrow 19 and NumPy 1.26, supports CuPy 14, and exposes more low-level functionality in Python including
cudf.filter,column_nans_to_nulls, and named capture groups in string extraction. On the execution side, it restores multithreaded CSV reads, expands Parquet reader expression support with fixed-point predicate pushdown and decimal-width controls, enables dynamic planning by default with therapidsmpfruntime, and reports a 500x faster attribute lookup path incudf.pandas. There is some migration risk too: missing string values now render asNoneinstead of<NA>, pandas nullable dtypes are preserved more aggressively, and several fixes land in joins, groupby math, chunked Parquet reads, multi-GPU device selection, and JIT kernels on non-default CUDA streams, which should make larger GPU ETL pipelines less fragile.Lance v6.0.0-rc.2 is a storage and indexing-heavy release for Python data systems work: it adds segmented inverted index build and search, zonemap index segments, IVF_FLAT support for
float16andfloat64, vector partition search parallelism, and exposesbatch_size_bytes, base-scoped store bindings, andhas_stable_row_idsin Python-facing APIs. The release also changes internals in ways teams should notice before upgrading, including vendoring the tokenizer stack into Lance, generalized object-store credentials, direct credential vending without the Azure SDK orgoogle-cloud-auth, planned blob reads with source-level coalescing, and cleanup of transaction files on failed commits. On the operational side it tightens correctness and throughput with serialized namespace manifest mutations, safermerge_insertbatching and index-search filtering, JSON float64 detection fixes, lower manifest memory use, SIMD distance kernels, faster ARM RaBitQ distance, and less scheduler overhead on small reads.LanceDB Python v0.31.0-beta.11 keeps pushing the new namespace model forward, most notably with manifest-enabled directory namespace mode, nested namespace operations when listing databases, and child-namespace support plus JSON serialization on
LanceDBConnection. Across the v0.31.0-beta.x train, the Python package also tightened operational behavior with hostname verification enabled by default, removed the legacy Tantivy full-text search path, and fixed several namespace-table edge cases around schema-only creates, namespace-backed Rust connections, and materializing declared namespace tables. For Python data systems teams, this looks like a meaningful prerelease if you are testing multi-namespace layouts or managed deployments, but the namespace refactor that started in v0.31.0-beta.0 also signals some migration risk.Pantab 5.3.0 is a Python compatibility update: it adds Python 3.14 support and drops Python 3.9 and 3.10. For teams moving data between pandas and Tableau Hyper files, that means you can bring pantab onto newer interpreter builds, but older 3.9/3.10-based jobs now need a Python upgrade before taking this release.
DuckDB 1.5.2 is a bugfix release, but it touches a lot of engine paths that matter in Python pipelines: it fixes WAL replay and recovery around empty checkpoint WAL files, row-group growth on repeated load-and-insert cycles for indexed tables, and several memory-safety problems including prepared statement reuse, ADBC races, and CSV buffer boundary reads. It also tightens query correctness with fixes for ASOF joins, TopN window elimination, common subplan optimization,
DELETE RETURNINGin the same transaction, and incorrect results fromtryinsideif, while improving data interchange by inferring timezone-aware timestamps inread_json_auto, fixing multiple Arrow edge cases, and adding support for Snowflake-produced shredded VARIANT Parquet files plus partial VARIANT shredding on write. For teams moving to 1.5.2, the practical story is more reliable storage recovery, safer Arrow and ADBC integration, and fewer wrong-result cases in analytic queries without a headline API change.datafusion-table-providers v0.11.0 pushes much harder on predicate and query pushdown: MongoDB gets broader pushdown support plus wider Arrow type coverage, and SQL providers now have more comprehensive pushdown support, with follow-up fixes for sort and limit pushdown. For Python users wiring DataFusion into external systems, this also restores MongoDB Python bindings with SRV support, updates ADBC pieces, and fixes a MySQL column-count mismatch that could panic
BatchCoalescer, so remote scans and federated queries should be both more capable and less brittle.pandera v0.31.1 mainly fixes a packaging problem so
pandera[polars]can be imported without pulling in pandas, which matters if you use Pandera as a schema layer in Polars-first pipelines. The bigger shift came in v0.31.0: Pandera now validatesxarrayDataArray,Dataset, andDataTreeobjects with both schema and model APIs, adds first-classgeopandasschema and model support, and expands IO serialization across supported backends, including direct serialization forDatasetModelandDataFrameModel. That release also tightened correctness around strict and ordered schema errors, preserved nullable and timezone-aware pandas metadata more reliably, and improved PySpark and Ibis error reporting, so teams using one validation layer across pandas, Polars, Spark, and array data get fewer backend-specific surprises.

