pkgsrc-WIP-changes archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

py-distributed: Update to 2024.4.2



Module Name:	pkgsrc-wip
Committed By:	Matthew Danielson <matthewd%fastmail.us@localhost>
Pushed By:	matthewd
Date:		Tue Apr 30 15:22:13 2024 -0700
Changeset:	b4199353a5f8906a8d126092d8bc1bcc94b04ecb

Modified Files:
	py-distributed/Makefile
	py-distributed/distinfo

Log Message:
py-distributed: Update to 2024.4.2

Changes for py-distributed are intermingled with those for py-dask.

2024.4.2
Highlights
Trivial Merge Implementation

The Query Optimizer will inspect quires to determine if a merge(...) or
groupby(...).apply(...) requires a shuffle. A shuffle can be avoided, if
the DataFrame was shuffled on the same columns in a previous step
without any operations in between that change the partitioning layout or
the relevant values in each partition.

result = df.merge(df2, on="a")

result = result.merge(df3, on="a")

The Query optimizer will identify that result was previously shuffled on
"a" as well and thus only shuffle df3 in the second merge operation
before doing a blockwise merge.
Auto-partitioning in read_parquet

The Query Optimizer will automatically repartition datasets read from
Parquet files if individual partitions are too small. This will reduce
the number of partitions in consequentially also the size of the task
graph.

The Optimizer aims to produce partitions of at least 75MB and will
combine multiple files together if necessary to reach this threshold.
The value can be configured by using

dask.config.set({"dataframe.parquet.minimum-partition-size":
100_000_000})

The value is given in bytes. The default threshold is relatively
conservative to avoid memory issues on worker nodes with a relatively
small amount of memory per thread.
2024.4.1

This is a minor bugfix release that that fixes an error when importing
dask.dataframe with Python 3.11.9.

See GH#11035 and GH#11039 from Richard (Rick) Zamora for details.
2024.4.0
Highlights
Query planning fixes

This release contains a variety of bugfixes in Dask DataFrame’s new
query planner.
GPU metric dashboard fixes

GPU memory and utilization dashboard functionality has been restored.
Previously these plots were unintentionally left blank.

See GH#8572 from Benjamin Zaitlen for details.
2024.3.1

This is a minor release that primarily demotes an exception to a warning
if dask-expr is not installed when upgrading.
2024.3.0

Released on March 11, 2024
Highlights
Query planning

This release is enabling query planning by default for all users of
dask.dataframe.

The query planning functionality represents a rewrite of the DataFrame
using dask-expr. This is a drop-in replacement and we expect that most
users will not have to adjust any of their code. Any feedback can be
reported on the Dask issue tracker or on the query planning feedback
issue.

If you are encountering any issues you are still able to opt-out by
setting

import dask

dask.config.set({'dataframe.query-planning': False})

Sunset of Pandas 1.X support

The new query planning backend is requiring at least pandas 2.0. This
pandas version will automatically be installed if you are installing
from conda or if you are installing using dask[complete] or
dask[dataframe] from pip.

The legacy DataFrame implementation is still supporting pandas 1.X if
you install dask without extras.

To see a diff of this commit:
https://wip.pkgsrc.org/cgi-bin/gitweb.cgi?p=pkgsrc-wip.git;a=commitdiff;h=b4199353a5f8906a8d126092d8bc1bcc94b04ecb

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.

diffstat:
 py-distributed/Makefile | 2 +-
 py-distributed/distinfo | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diffs:
diff --git a/py-distributed/Makefile b/py-distributed/Makefile
index d1e5a88d22..3d07b06df5 100644
--- a/py-distributed/Makefile
+++ b/py-distributed/Makefile
@@ -1,6 +1,6 @@
 # $NetBSD$
 
-DISTNAME=	distributed-2024.2.1
+DISTNAME=	distributed-2024.4.2
 PKGNAME=	${PYPKGPREFIX}-${DISTNAME}
 CATEGORIES=	devel net
 GITHUB_PROJECT=	distributed
diff --git a/py-distributed/distinfo b/py-distributed/distinfo
index 92d4d8b9fe..e61b045141 100644
--- a/py-distributed/distinfo
+++ b/py-distributed/distinfo
@@ -1,5 +1,5 @@
 $NetBSD$
 
-BLAKE2s (distributed-2024.2.1.tar.gz) = dad26e99ca0836b07c623a61f218bfc0213a0f830b44a0e692d0ab713cee463c
-SHA512 (distributed-2024.2.1.tar.gz) = 9f51d49a8e627e35a70889e25d2f61d9e7ca8906da9191c2691399e52af4e1fe3a9794b097a8d7c3b46fd7466b091c9be0170984b3e3a4bcb4f90902424b6b50
-Size (distributed-2024.2.1.tar.gz) = 2543868 bytes
+BLAKE2s (distributed-2024.4.2.tar.gz) = d42ea6e4acec1cafd4d40ca64fb05f5bee42998548b2364b1652c66018eccf02
+SHA512 (distributed-2024.4.2.tar.gz) = d0ef80466ddec9c8e5e1216e2efa42fb7b839d1b3ffc266a758624ff24bf0d5304bcb63b9a566e808a42819a481a71d052ca34b456685b044560baf225863136
+Size (distributed-2024.4.2.tar.gz) = 2552748 bytes


Home | Main Index | Thread Index | Old Index