Merk
Tilgang til denne siden krever autorisasjon. Du kan prøve å logge på eller endre kataloger.
Tilgang til denne siden krever autorisasjon. Du kan prøve å endre kataloger.
Clusters the output by the given columns. Records with similar values on the clustering columns are grouped together in the same file. Clustering improves query efficiency by allowing queries with predicates on the clustering columns to skip unnecessary data. Unlike partitioning, clustering can be used on high-cardinality columns.
Syntax
clusterBy(*cols)
Parameters
| Parameter | Type | Description |
|---|---|---|
*cols |
str or list | Names of the columns to cluster by. |
Returns
DataStreamWriter
Examples
df = spark.readStream.format("rate").load()
df.writeStream.clusterBy("value")
# <...streaming.readwriter.DataStreamWriter object ...>
Cluster a Rate source stream by timestamp and write to Parquet:
import tempfile
import time
with tempfile.TemporaryDirectory(prefix="clusterBy1") as d:
with tempfile.TemporaryDirectory(prefix="clusterBy2") as cp:
df = spark.readStream.format("rate").option("rowsPerSecond", 10).load()
q = df.writeStream.clusterBy(
"timestamp").format("parquet").option("checkpointLocation", cp).start(d)
time.sleep(5)
q.stop()
spark.read.schema(df.schema).parquet(d).show()