Class KeyValueGroupedDataset<K,V>
- All Implemented Interfaces:
Serializable
Dataset
has been logically grouped by a user specified grouping key. Users should not
construct a KeyValueGroupedDataset
directly, but should instead call groupByKey
on
an existing Dataset
.
- Since:
- 2.0.0
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionagg
(TypedColumn<V, U1> col1) Computes the given aggregation, returning aDataset
of tuples for each unique key and the result of computing this aggregation over all elements in the group.agg
(TypedColumn<V, U1> col1, TypedColumn<V, U2> col2) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.agg
(TypedColumn<V, U1> col1, TypedColumn<V, U2> col2, TypedColumn<V, U3> col3) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.agg
(TypedColumn<V, U1> col1, TypedColumn<V, U2> col2, TypedColumn<V, U3> col3, TypedColumn<V, U4> col4) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.agg
(TypedColumn<V, U1> col1, TypedColumn<V, U2> col2, TypedColumn<V, U3> col3, TypedColumn<V, U4> col4, TypedColumn<V, U5> col5) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.agg
(TypedColumn<V, U1> col1, TypedColumn<V, U2> col2, TypedColumn<V, U3> col3, TypedColumn<V, U4> col4, TypedColumn<V, U5> col5, TypedColumn<V, U6> col6) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.agg
(TypedColumn<V, U1> col1, TypedColumn<V, U2> col2, TypedColumn<V, U3> col3, TypedColumn<V, U4> col4, TypedColumn<V, U5> col5, TypedColumn<V, U6> col6, TypedColumn<V, U7> col7) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.agg
(TypedColumn<V, U1> col1, TypedColumn<V, U2> col2, TypedColumn<V, U3> col3, TypedColumn<V, U4> col4, TypedColumn<V, U5> col5, TypedColumn<V, U6> col6, TypedColumn<V, U7> col7, TypedColumn<V, U8> col8) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.<U,
R> Dataset<R> cogroup
(KeyValueGroupedDataset<K, U> other, CoGroupFunction<K, V, U, R> f, Encoder<R> encoder) (Java-specific) Applies the given function to each cogrouped data.<U,
R> Dataset<R> cogroup
(KeyValueGroupedDataset<K, U> other, scala.Function3<K, scala.collection.Iterator<V>, scala.collection.Iterator<U>, scala.collection.IterableOnce<R>> f, Encoder<R> evidence$29) (Scala-specific) Applies the given function to each cogrouped data.<U,
R> Dataset<R> cogroupSorted
(KeyValueGroupedDataset<K, U> other, Column[] thisSortExprs, Column[] otherSortExprs, CoGroupFunction<K, V, U, R> f, Encoder<R> encoder) (Java-specific) Applies the given function to each sorted cogrouped data.<U,
R> Dataset<R> cogroupSorted
(KeyValueGroupedDataset<K, U> other, scala.collection.immutable.Seq<Column> thisSortExprs, scala.collection.immutable.Seq<Column> otherSortExprs, scala.Function3<K, scala.collection.Iterator<V>, scala.collection.Iterator<U>, scala.collection.IterableOnce<R>> f, Encoder<R> evidence$30) (Scala-specific) Applies the given function to each sorted cogrouped data.count()
Returns aDataset
that contains a tuple with each key and the number of items present for that key.<U> Dataset<U>
flatMapGroups
(FlatMapGroupsFunction<K, V, U> f, Encoder<U> encoder) (Java-specific) Applies the given function to each group of data.<U> Dataset<U>
flatMapGroups
(scala.Function2<K, scala.collection.Iterator<V>, scala.collection.IterableOnce<U>> f, Encoder<U> evidence$3) (Scala-specific) Applies the given function to each group of data.<S,
U> Dataset<U> flatMapGroupsWithState
(FlatMapGroupsWithStateFunction<K, V, S, U> func, OutputMode outputMode, Encoder<S> stateEncoder, Encoder<U> outputEncoder, GroupStateTimeout timeoutConf) (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.<S,
U> Dataset<U> flatMapGroupsWithState
(FlatMapGroupsWithStateFunction<K, V, S, U> func, OutputMode outputMode, Encoder<S> stateEncoder, Encoder<U> outputEncoder, GroupStateTimeout timeoutConf, KeyValueGroupedDataset<K, S> initialState) (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.<S,
U> Dataset<U> flatMapGroupsWithState
(OutputMode outputMode, GroupStateTimeout timeoutConf, KeyValueGroupedDataset<K, S> initialState, scala.Function3<K, scala.collection.Iterator<V>, GroupState<S>, scala.collection.Iterator<U>> func, Encoder<S> evidence$14, Encoder<U> evidence$15) (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.<S,
U> Dataset<U> flatMapGroupsWithState
(OutputMode outputMode, GroupStateTimeout timeoutConf, scala.Function3<K, scala.collection.Iterator<V>, GroupState<S>, scala.collection.Iterator<U>> func, Encoder<S> evidence$12, Encoder<U> evidence$13) (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.<U> Dataset<U>
flatMapSortedGroups
(Column[] SortExprs, FlatMapGroupsFunction<K, V, U> f, Encoder<U> encoder) (Java-specific) Applies the given function to each group of data.<U> Dataset<U>
flatMapSortedGroups
(scala.collection.immutable.Seq<Column> sortExprs, scala.Function2<K, scala.collection.Iterator<V>, scala.collection.IterableOnce<U>> f, Encoder<U> evidence$4) (Scala-specific) Applies the given function to each group of data.<L> KeyValueGroupedDataset<L,
V> Returns a newKeyValueGroupedDataset
where the type of the key has been mapped to the specified type.keys()
Returns aDataset
that contains each unique key.<U> Dataset<U>
mapGroups
(MapGroupsFunction<K, V, U> f, Encoder<U> encoder) (Java-specific) Applies the given function to each group of data.<U> Dataset<U>
(Scala-specific) Applies the given function to each group of data.<S,
U> Dataset<U> mapGroupsWithState
(MapGroupsWithStateFunction<K, V, S, U> func, Encoder<S> stateEncoder, Encoder<U> outputEncoder) (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.<S,
U> Dataset<U> mapGroupsWithState
(MapGroupsWithStateFunction<K, V, S, U> func, Encoder<S> stateEncoder, Encoder<U> outputEncoder, GroupStateTimeout timeoutConf) (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.<S,
U> Dataset<U> mapGroupsWithState
(MapGroupsWithStateFunction<K, V, S, U> func, Encoder<S> stateEncoder, Encoder<U> outputEncoder, GroupStateTimeout timeoutConf, KeyValueGroupedDataset<K, S> initialState) (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.<S,
U> Dataset<U> mapGroupsWithState
(GroupStateTimeout timeoutConf, KeyValueGroupedDataset<K, S> initialState, scala.Function3<K, scala.collection.Iterator<V>, GroupState<S>, U> func, Encoder<S> evidence$10, Encoder<U> evidence$11) (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.<S,
U> Dataset<U> mapGroupsWithState
(GroupStateTimeout timeoutConf, scala.Function3<K, scala.collection.Iterator<V>, GroupState<S>, U> func, Encoder<S> evidence$8, Encoder<U> evidence$9) (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.<S,
U> Dataset<U> mapGroupsWithState
(scala.Function3<K, scala.collection.Iterator<V>, GroupState<S>, U> func, Encoder<S> evidence$6, Encoder<U> evidence$7) (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.<W> KeyValueGroupedDataset<K,
W> mapValues
(MapFunction<V, W> func, Encoder<W> encoder) Returns a newKeyValueGroupedDataset
where the given functionfunc
has been applied to the data.<W> KeyValueGroupedDataset<K,
W> Returns a newKeyValueGroupedDataset
where the given functionfunc
has been applied to the data.org.apache.spark.sql.execution.QueryExecution
(Java-specific) Reduces the elements of each group of data using the specified binary function.reduceGroups
(scala.Function2<V, V, V> f) (Scala-specific) Reduces the elements of each group of data using the specified binary function.toString()
-
Method Details
-
agg
Computes the given aggregation, returning aDataset
of tuples for each unique key and the result of computing this aggregation over all elements in the group.- Parameters:
col1
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
agg
Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.- Parameters:
col1
- (undocumented)col2
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
agg
public <U1,U2, Dataset<scala.Tuple4<K,U3> U1, aggU2, U3>> (TypedColumn<V, U1> col1, TypedColumn<V, U2> col2, TypedColumn<V, U3> col3) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.- Parameters:
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
agg
public <U1,U2, Dataset<scala.Tuple5<K,U3, U4> U1, aggU2, U3, U4>> (TypedColumn<V, U1> col1, TypedColumn<V, U2> col2, TypedColumn<V, U3> col3, TypedColumn<V, U4> col4) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.- Parameters:
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)col4
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
agg
public <U1,U2, Dataset<scala.Tuple6<K,U3, U4, U5> U1, aggU2, U3, U4, U5>> (TypedColumn<V, U1> col1, TypedColumn<V, U2> col2, TypedColumn<V, U3> col3, TypedColumn<V, U4> col4, TypedColumn<V, U5> col5) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.- Parameters:
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)col4
- (undocumented)col5
- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.0.0
-
agg
public <U1,U2, Dataset<scala.Tuple7<K,U3, U4, U5, U6> U1, aggU2, U3, U4, U5, U6>> (TypedColumn<V, U1> col1, TypedColumn<V, U2> col2, TypedColumn<V, U3> col3, TypedColumn<V, U4> col4, TypedColumn<V, U5> col5, TypedColumn<V, U6> col6) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.- Parameters:
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)col4
- (undocumented)col5
- (undocumented)col6
- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.0.0
-
agg
public <U1,U2, Dataset<scala.Tuple8<K,U3, U4, U5, U6, U7> U1, aggU2, U3, U4, U5, U6, U7>> (TypedColumn<V, U1> col1, TypedColumn<V, U2> col2, TypedColumn<V, U3> col3, TypedColumn<V, U4> col4, TypedColumn<V, U5> col5, TypedColumn<V, U6> col6, TypedColumn<V, U7> col7) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.- Parameters:
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)col4
- (undocumented)col5
- (undocumented)col6
- (undocumented)col7
- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.0.0
-
agg
public <U1,U2, Dataset<scala.Tuple9<K,U3, U4, U5, U6, U7, U8> U1, aggU2, U3, U4, U5, U6, U7, U8>> (TypedColumn<V, U1> col1, TypedColumn<V, U2> col2, TypedColumn<V, U3> col3, TypedColumn<V, U4> col4, TypedColumn<V, U5> col5, TypedColumn<V, U6> col6, TypedColumn<V, U7> col7, TypedColumn<V, U8> col8) Computes the given aggregations, returning aDataset
of tuples for each unique key and the result of computing these aggregations over all elements in the group.- Parameters:
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)col4
- (undocumented)col5
- (undocumented)col6
- (undocumented)col7
- (undocumented)col8
- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.0.0
-
cogroup
public <U,R> Dataset<R> cogroup(KeyValueGroupedDataset<K, U> other, scala.Function3<K, scala.collection.Iterator<V>, scala.collection.Iterator<U>, scala.collection.IterableOnce<R>> f, Encoder<R> evidence$29) (Scala-specific) Applies the given function to each cogrouped data. For each unique group, the function will be passed the grouping key and 2 iterators containing all elements in the group fromDataset
this
andother
. The function can return an iterator containing elements of an arbitrary type which will be returned as a newDataset
.- Parameters:
other
- (undocumented)f
- (undocumented)evidence$29
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
cogroup
public <U,R> Dataset<R> cogroup(KeyValueGroupedDataset<K, U> other, CoGroupFunction<K, V, U, R> f, Encoder<R> encoder) (Java-specific) Applies the given function to each cogrouped data. For each unique group, the function will be passed the grouping key and 2 iterators containing all elements in the group fromDataset
this
andother
. The function can return an iterator containing elements of an arbitrary type which will be returned as a newDataset
.- Parameters:
other
- (undocumented)f
- (undocumented)encoder
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
cogroupSorted
public <U,R> Dataset<R> cogroupSorted(KeyValueGroupedDataset<K, U> other, scala.collection.immutable.Seq<Column> thisSortExprs, scala.collection.immutable.Seq<Column> otherSortExprs, scala.Function3<K, scala.collection.Iterator<V>, scala.collection.Iterator<U>, scala.collection.IterableOnce<R>> f, Encoder<R> evidence$30) (Scala-specific) Applies the given function to each sorted cogrouped data. For each unique group, the function will be passed the grouping key and 2 sorted iterators containing all elements in the group fromDataset
this
andother
. The function can return an iterator containing elements of an arbitrary type which will be returned as a newDataset
.This is equivalent to
cogroup(org.apache.spark.sql.KeyValueGroupedDataset<K, U>, scala.Function3<K, scala.collection.Iterator<V>, scala.collection.Iterator<U>, scala.collection.IterableOnce<R>>, org.apache.spark.sql.Encoder<R>)
, except for the iterators to be sorted according to the given sort expressions. That sorting does not add computational complexity.- Parameters:
other
- (undocumented)thisSortExprs
- (undocumented)otherSortExprs
- (undocumented)f
- (undocumented)evidence$30
- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.4.0
- See Also:
-
cogroupSorted
public <U,R> Dataset<R> cogroupSorted(KeyValueGroupedDataset<K, U> other, Column[] thisSortExprs, Column[] otherSortExprs, CoGroupFunction<K, V, U, R> f, Encoder<R> encoder) (Java-specific) Applies the given function to each sorted cogrouped data. For each unique group, the function will be passed the grouping key and 2 sorted iterators containing all elements in the group fromDataset
this
andother
. The function can return an iterator containing elements of an arbitrary type which will be returned as a newDataset
.This is equivalent to
cogroup(org.apache.spark.sql.KeyValueGroupedDataset<K, U>, scala.Function3<K, scala.collection.Iterator<V>, scala.collection.Iterator<U>, scala.collection.IterableOnce<R>>, org.apache.spark.sql.Encoder<R>)
, except for the iterators to be sorted according to the given sort expressions. That sorting does not add computational complexity.- Parameters:
other
- (undocumented)thisSortExprs
- (undocumented)otherSortExprs
- (undocumented)f
- (undocumented)encoder
- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.4.0
- See Also:
-
count
Returns aDataset
that contains a tuple with each key and the number of items present for that key.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
flatMapGroups
public <U> Dataset<U> flatMapGroups(scala.Function2<K, scala.collection.Iterator<V>, scala.collection.IterableOnce<U>> f, Encoder<U> evidence$3) (Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a newDataset
.This function does not support partial aggregation, and as a result requires shuffling all the data in the
Dataset
. If an application intends to perform an aggregation over each key, it is best to use the reduce function or anorg.apache.spark.sql.expressions#Aggregator
.Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling
toList
) unless they are sure that this is possible given the memory constraints of their cluster.- Parameters:
f
- (undocumented)evidence$3
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
flatMapGroups
(Java-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a newDataset
.This function does not support partial aggregation, and as a result requires shuffling all the data in the
Dataset
. If an application intends to perform an aggregation over each key, it is best to use the reduce function or anorg.apache.spark.sql.expressions#Aggregator
.Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling
toList
) unless they are sure that this is possible given the memory constraints of their cluster.- Parameters:
f
- (undocumented)encoder
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
flatMapGroupsWithState
public <S,U> Dataset<U> flatMapGroupsWithState(OutputMode outputMode, GroupStateTimeout timeoutConf, scala.Function3<K, scala.collection.Iterator<V>, GroupState<S>, scala.collection.Iterator<U>> func, Encoder<S> evidence$12, Encoder<U> evidence$13) (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. SeeGroupState
for more details.- Parameters:
func
- Function to be called on every group.outputMode
- The output mode of the function.timeoutConf
- Timeout configuration for groups that do not receive data for a while.See
Encoder
for more details on what types are encodable to Spark SQL.evidence$12
- (undocumented)evidence$13
- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.2.0
-
flatMapGroupsWithState
public <S,U> Dataset<U> flatMapGroupsWithState(OutputMode outputMode, GroupStateTimeout timeoutConf, KeyValueGroupedDataset<K, S> initialState, scala.Function3<K, scala.collection.Iterator<V>, GroupState<S>, scala.collection.Iterator<U>> func, Encoder<S> evidence$14, Encoder<U> evidence$15) (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. SeeGroupState
for more details.- Parameters:
func
- Function to be called on every group.outputMode
- The output mode of the function.timeoutConf
- Timeout configuration for groups that do not receive data for a while.initialState
- The user provided state that will be initialized when the first batch of data is processed in the streaming query. The user defined function will be called on the state data even if there are no other values in the group. To covert a Datasetds
of type of typeDataset[(K, S)]
to aKeyValueGroupedDataset[K, S]
, use
See {@link Encoder} for more details on what types are encodable to Spark SQL. @since 3.2.0ds.groupByKey(x => x._1).mapValues(_._2)
evidence$14
- (undocumented)evidence$15
- (undocumented)- Returns:
- (undocumented)
-
flatMapGroupsWithState
public <S,U> Dataset<U> flatMapGroupsWithState(FlatMapGroupsWithStateFunction<K, V, S, U> func, OutputMode outputMode, Encoder<S> stateEncoder, Encoder<U> outputEncoder, GroupStateTimeout timeoutConf) (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. SeeGroupState
for more details.- Parameters:
func
- Function to be called on every group.outputMode
- The output mode of the function.stateEncoder
- Encoder for the state type.outputEncoder
- Encoder for the output type.timeoutConf
- Timeout configuration for groups that do not receive data for a while.See
Encoder
for more details on what types are encodable to Spark SQL.- Returns:
- (undocumented)
- Since:
- 2.2.0
-
flatMapGroupsWithState
public <S,U> Dataset<U> flatMapGroupsWithState(FlatMapGroupsWithStateFunction<K, V, S, U> func, OutputMode outputMode, Encoder<S> stateEncoder, Encoder<U> outputEncoder, GroupStateTimeout timeoutConf, KeyValueGroupedDataset<K, S> initialState) (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. SeeGroupState
for more details.- Parameters:
func
- Function to be called on every group.outputMode
- The output mode of the function.stateEncoder
- Encoder for the state type.outputEncoder
- Encoder for the output type.timeoutConf
- Timeout configuration for groups that do not receive data for a while.initialState
- The user provided state that will be initialized when the first batch of data is processed in the streaming query. The user defined function will be called on the state data even if there are no other values in the group. To covert a Datasetds
of type of typeDataset[(K, S)]
to aKeyValueGroupedDataset[K, S]
, use
See {@link Encoder} for more details on what types are encodable to Spark SQL. @since 3.2.0ds.groupByKey(x => x._1).mapValues(_._2)
- Returns:
- (undocumented)
-
flatMapSortedGroups
public <U> Dataset<U> flatMapSortedGroups(scala.collection.immutable.Seq<Column> sortExprs, scala.Function2<K, scala.collection.Iterator<V>, scala.collection.IterableOnce<U>> f, Encoder<U> evidence$4) (Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and a sorted iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a newDataset
.This function does not support partial aggregation, and as a result requires shuffling all the data in the
Dataset
. If an application intends to perform an aggregation over each key, it is best to use the reduce function or anorg.apache.spark.sql.expressions#Aggregator
.Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling
toList
) unless they are sure that this is possible given the memory constraints of their cluster.This is equivalent to
flatMapGroups(scala.Function2<K, scala.collection.Iterator<V>, scala.collection.IterableOnce<U>>, org.apache.spark.sql.Encoder<U>)
, except for the iterator to be sorted according to the given sort expressions. That sorting does not add computational complexity.- Parameters:
sortExprs
- (undocumented)f
- (undocumented)evidence$4
- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.4.0
- See Also:
-
flatMapSortedGroups
public <U> Dataset<U> flatMapSortedGroups(Column[] SortExprs, FlatMapGroupsFunction<K, V, U> f, Encoder<U> encoder) (Java-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and a sorted iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a newDataset
.This function does not support partial aggregation, and as a result requires shuffling all the data in the
Dataset
. If an application intends to perform an aggregation over each key, it is best to use the reduce function or anorg.apache.spark.sql.expressions#Aggregator
.Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling
toList
) unless they are sure that this is possible given the memory constraints of their cluster.This is equivalent to
flatMapGroups(scala.Function2<K, scala.collection.Iterator<V>, scala.collection.IterableOnce<U>>, org.apache.spark.sql.Encoder<U>)
, except for the iterator to be sorted according to the given sort expressions. That sorting does not add computational complexity.- Parameters:
SortExprs
- (undocumented)f
- (undocumented)encoder
- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.4.0
- See Also:
-
keyAs
Returns a newKeyValueGroupedDataset
where the type of the key has been mapped to the specified type. The mapping of key columns to the type follows the same rules asas
onDataset
.- Parameters:
evidence$1
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
keys
Returns aDataset
that contains each unique key. This is equivalent to doing mapping over the Dataset to extract the keys and then running a distinct operation on those.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
mapGroups
public <U> Dataset<U> mapGroups(scala.Function2<K, scala.collection.Iterator<V>, U> f, Encoder<U> evidence$5) (Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an element of arbitrary type which will be returned as a newDataset
.This function does not support partial aggregation, and as a result requires shuffling all the data in the
Dataset
. If an application intends to perform an aggregation over each key, it is best to use the reduce function or anorg.apache.spark.sql.expressions#Aggregator
.Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling
toList
) unless they are sure that this is possible given the memory constraints of their cluster.- Parameters:
f
- (undocumented)evidence$5
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
mapGroups
(Java-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an element of arbitrary type which will be returned as a newDataset
.This function does not support partial aggregation, and as a result requires shuffling all the data in the
Dataset
. If an application intends to perform an aggregation over each key, it is best to use the reduce function or anorg.apache.spark.sql.expressions#Aggregator
.Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling
toList
) unless they are sure that this is possible given the memory constraints of their cluster.- Parameters:
f
- (undocumented)encoder
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
mapGroupsWithState
public <S,U> Dataset<U> mapGroupsWithState(scala.Function3<K, scala.collection.Iterator<V>, GroupState<S>, U> func, Encoder<S> evidence$6, Encoder<U> evidence$7) (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. SeeGroupState
for more details.- Parameters:
func
- Function to be called on every group.See
Encoder
for more details on what types are encodable to Spark SQL.evidence$6
- (undocumented)evidence$7
- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.2.0
-
mapGroupsWithState
public <S,U> Dataset<U> mapGroupsWithState(GroupStateTimeout timeoutConf, scala.Function3<K, scala.collection.Iterator<V>, GroupState<S>, U> func, Encoder<S> evidence$8, Encoder<U> evidence$9) (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. SeeGroupState
for more details.- Parameters:
func
- Function to be called on every group.timeoutConf
- Timeout configuration for groups that do not receive data for a while.See
Encoder
for more details on what types are encodable to Spark SQL.evidence$8
- (undocumented)evidence$9
- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.2.0
-
mapGroupsWithState
public <S,U> Dataset<U> mapGroupsWithState(GroupStateTimeout timeoutConf, KeyValueGroupedDataset<K, S> initialState, scala.Function3<K, scala.collection.Iterator<V>, GroupState<S>, U> func, Encoder<S> evidence$10, Encoder<U> evidence$11) (Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. SeeGroupState
for more details.- Parameters:
func
- Function to be called on every group.timeoutConf
- Timeout Conf, see GroupStateTimeout for more detailsinitialState
- The user provided state that will be initialized when the first batch of data is processed in the streaming query. The user defined function will be called on the state data even if there are no other values in the group. To convert a Dataset ds of type Dataset[(K, S)] to a KeyValueGroupedDataset[K, S] do
See {@link Encoder} for more details on what types are encodable to Spark SQL. @since 3.2.0ds.groupByKey(x => x._1).mapValues(_._2)
evidence$10
- (undocumented)evidence$11
- (undocumented)- Returns:
- (undocumented)
-
mapGroupsWithState
public <S,U> Dataset<U> mapGroupsWithState(MapGroupsWithStateFunction<K, V, S, U> func, Encoder<S> stateEncoder, Encoder<U> outputEncoder) (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. SeeGroupState
for more details.- Parameters:
func
- Function to be called on every group.stateEncoder
- Encoder for the state type.outputEncoder
- Encoder for the output type.See
Encoder
for more details on what types are encodable to Spark SQL.- Returns:
- (undocumented)
- Since:
- 2.2.0
-
mapGroupsWithState
public <S,U> Dataset<U> mapGroupsWithState(MapGroupsWithStateFunction<K, V, S, U> func, Encoder<S> stateEncoder, Encoder<U> outputEncoder, GroupStateTimeout timeoutConf) (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. SeeGroupState
for more details.- Parameters:
func
- Function to be called on every group.stateEncoder
- Encoder for the state type.outputEncoder
- Encoder for the output type.timeoutConf
- Timeout configuration for groups that do not receive data for a while.See
Encoder
for more details on what types are encodable to Spark SQL.- Returns:
- (undocumented)
- Since:
- 2.2.0
-
mapGroupsWithState
public <S,U> Dataset<U> mapGroupsWithState(MapGroupsWithStateFunction<K, V, S, U> func, Encoder<S> stateEncoder, Encoder<U> outputEncoder, GroupStateTimeout timeoutConf, KeyValueGroupedDataset<K, S> initialState) (Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. SeeGroupState
for more details.- Parameters:
func
- Function to be called on every group.stateEncoder
- Encoder for the state type.outputEncoder
- Encoder for the output type.timeoutConf
- Timeout configuration for groups that do not receive data for a while.initialState
- The user provided state that will be initialized when the first batch of data is processed in the streaming query. The user defined function will be called on the state data even if there are no other values in the group.See
Encoder
for more details on what types are encodable to Spark SQL.- Returns:
- (undocumented)
- Since:
- 3.2.0
-
mapValues
Returns a newKeyValueGroupedDataset
where the given functionfunc
has been applied to the data. The grouping key is unchanged by this.// Create values grouped by key from a Dataset[(K, V)] ds.groupByKey(_._1).mapValues(_._2) // Scala
- Parameters:
func
- (undocumented)evidence$2
- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.1.0
-
mapValues
Returns a newKeyValueGroupedDataset
where the given functionfunc
has been applied to the data. The grouping key is unchanged by this.// Create Integer values grouped by String key from a Dataset<Tuple2<String, Integer>> Dataset<Tuple2<String, Integer>> ds = ...; KeyValueGroupedDataset<String, Integer> grouped = ds.groupByKey(t -> t._1, Encoders.STRING()).mapValues(t -> t._2, Encoders.INT());
- Parameters:
func
- (undocumented)encoder
- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.1.0
-
queryExecution
public org.apache.spark.sql.execution.QueryExecution queryExecution() -
reduceGroups
(Scala-specific) Reduces the elements of each group of data using the specified binary function. The given function must be commutative and associative or the result may be non-deterministic.- Parameters:
f
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
reduceGroups
(Java-specific) Reduces the elements of each group of data using the specified binary function. The given function must be commutative and associative or the result may be non-deterministic.- Parameters:
f
- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
toString
-