Skip to content

Conversation

@jiangzho
Copy link
Contributor

@jiangzho jiangzho commented Aug 29, 2024

What changes were proposed in this pull request?

This PR includes Operator docs under docs/ for configuration, architecture, operations, and metrics.

Why are the changes needed?

Operator docs are necessary for users to understand the design and getting started with the operator installation

Does this PR introduce any user-facing change?

No - new release

How was this patch tested?

CIs

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make CI happy?


dependencies {
implementation project(":spark-operator")
implementation("org.projectlombok:lombok:$lombokVersion")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @jiangzho , it seems that your repository is outdated a little.

Apache Spark Kubernetes Operator follows the Gradle Version Catalog. Please rebase your repository and refer the following commit.

@dongjoon-hyun
Copy link
Member

Gentle ping, @jiangzho .

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-49464] Add docs for operator [SPARK-49464] Add documentations Sep 5, 2024

- JDK17
- Operator used fabric8 which assumes to be compatible with available k8s versions. However for using status subresource, please use k8s version 1.14 or above.
- Spark versions 3.4 or above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache Spark 3.4.x reaches the end-of-life very soon (2024-10-13).


### Compatibility

- JDK17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java 17 and 21.

### Compatibility

- JDK17
- Operator used fabric8 which assumes to be compatible with available k8s versions. However for using status subresource, please use k8s version 1.14 or above.
Copy link
Member

@dongjoon-hyun dongjoon-hyun Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I pinged you already, the K8s eco-system is moving faster in the public environment.

Just FYI, in the community, please don't claim which you didn't test explicitly.

- Operator used fabric8 which assumes to be compatible with available k8s versions. However for using status subresource, please use k8s version 1.14 or above.
- Spark versions 3.4 or above

## Manage Your Spark Operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this section because it's duplicated.

| operatorRbac.role.create | Whether to create Role for operator to use. At least one of `clusterRole.create` or `role.create` should be enabled | true |
| operatorRbac.roleBinding.create | Whether to create RoleBinding for operator to use. At least one of `clusterRoleBinding.create` or `roleBinding.create` should be enabled | true |
| operatorRbac.clusterRole.configManagement.roleName | Role name for operator configuration management (hot property loading and leader election) | `spark-operator-config-role` |
| appResources.namespaces.create | Whether to create dedicated namespaces for Spark apps. | `spark-operator-config-role-binding` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add clusterResources first before adding this document? It looks a little weird because the document is missing one of the part while Apache Spark Operator supports both SparkApp CRD and SparkCluster CRD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Add a short field in Spark Custom Resources page to start with. Also created SPARK-49528 to better doc the template support for clusters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually +1 for the point - appResources can be a bit misleading, since it may serve both SparkApp and SparkCluster. It was introduced to indicate this is for running Spark workload (comparing other resources created for operator deployment itself).

I shall fix this by SPARK-49623

settings.gradle Outdated
include 'spark-operator-api'
include 'spark-submission-worker'
include 'spark-operator'
include 'spark-operator-docs'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move this into build-tools directory? In addition, spark-operator-docs sounds like a little overclaim because this has only ConfOptionDocGenerator while the documentations has more contents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to SPARK-49527

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun
Copy link
Member

Thank you for updating this.

@jiangzho jiangzho marked this pull request as ready for review September 13, 2024 00:44
commandLine "java", "-classpath", sourceSets.main.runtimeClasspath.getAsPath(), javaMainClass, docsPath
}

build.finalizedBy(generateConfPropsDoc)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures the generated doc is updated per gradle build when new conf is introduced, if any

# Design & Architecture

**Spark-Kubernetes-Operator** (Operator) acts as a control plane to manage the complete
deployment lifecycle of Spark applications. The Operator can be installed on a Kubernetes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was correct, but not as of now because we add SparkCluster CRD.

I guess we need to revise README.md, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the review and sorry for the late response!

I updated this to try a best-effort to cover SparkCluster as well.

namespace and controls Spark deployments in one or more managed namespaces. The custom resource
definition (CRD) that describes the schema of a SparkApplication is a cluster wide resource.
For a CRD, the declaration must be registered before any resources of that CRDs kind(s) can be
used, and the registration process sometimes takes a few seconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove CRD details. It distracts Apache Spark K8s Operator explanation. We had better link to K8s CRD document.

For a CRD, the declaration must be registered before any resources of that CRDs kind(s) can be
used, and the registration process sometimes takes a few seconds.

For a CRD, the declaration must be registered before any resources of that CRDs kind(s) can be
used, and the registration process sometimes takes a few seconds.

Users can interact with the operator using the kubectl or k8s API. The Operator continuously
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not assume that the users are unaware of kubectl and helm. It's too common in these days, even in the ASF projects.

Users can interact with the operator using the kubectl or k8s API.

tracks cluster events relating to the SparkApplication custom resources. When the operator
receives a new resource update, it will take action to adjust the Kubernetes cluster to the
desired state as part of its reconciliation loop. The initial loop consists of the following
high-level steps:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to rewrite the above paragraph into a single sentence.

desired state as part of its reconciliation loop. The initial loop consists of the following
high-level steps:

* User submits a SparkApplication custom resource(CR) using kubectl / API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be clear that this is one of two cases: SparkApplication and SparkCluster .

desired state until the
current state becomes the desired state. All lifecycle management operations are realized
using this very simple
principle in the Operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove All lifecycle management operations are realized using this very simple principle in the Operator.


## State Transition

[<img src="resources/state.png">](resources/state.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

png is not editable. When we need to update this in the future, how can we do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diagrams are created with draw.io which allows import and update png for simple diagrams. Would you suggest we add this in docs as well ? It's not user facing but may help future works

[<img src="resources/state.png">](resources/state.png)

* Spark application are expected to run from submitted to succeeded before releasing resources
* User may configure the app CR to time-out after given threshold of time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in the diagram?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The application diagram tried to cover timeout blocks as well

* Spark application are expected to run from submitted to succeeded before releasing resources
* User may configure the app CR to time-out after given threshold of time
* In addition, user may configure the app CR to skip releasing resources after terminated. This is
typically used at dev phase: pods / configmaps. etc would be kept for debugging. They have
Copy link
Member

@dongjoon-hyun dongjoon-hyun Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pods / configmaps. etc? Although the meaning is clear, could you revise this grammatically?

To enable hot properties loading, update the **helm chart values file** with

```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant empty line.

# ... all other config overides...
dynamicConfig:
create: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant empty line.


## Config Metrics Publishing Behavior

Spark Operator uses the same source & sink interface as Apache Spark. You may
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source & sink -> source and sink

under the License.
-->

# Metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have a new file for this? I'd recommend to include this content into configuration.md.

under the License.
-->

# Operator Probes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this section.

* operator runtimeInfo health state
* Sentinel resources health state

### Operator Sentinel Resource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this section only into operations.md and remove operator_probles.md completely.

runtimeVersions:
scalaVersion: "2.13"
sparkVersion: "4.0.0-preview1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant empty link.


## Config Metrics Publishing Behavior

Spark Operator uses the same source & sink interface as Apache Spark. You may
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Spark has corresponding doc, it is better to add a hyper link here.

the [Dropwizard Metrics Library](https://metrics.dropwizard.io/4.2.25/). Note that Spark Operator
does not have Spark UI, MetricsServlet
and PrometheusServlet from org.apache.spark.metrics.sink package are not supported. If you are
interested in Prometheus metrics exporting, please take a look at below section `Forward Metrics to Prometheus`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add hyper link?


## Forward Metrics to Prometheus

In this section, we will show you how to forward spark operator metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this section, we will show you how to forward spark operator metrics
In this section, we will show you how to forward Spark Operator metrics

Comment on lines 64 to 65
* Modify the
build-tools/helm/spark-kubernetes-operator/values.yaml file' s metrics properties section:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Modify the
build-tools/helm/spark-kubernetes-operator/values.yaml file' s metrics properties section:
* Modify the metrics properties section in the file
`build-tools/helm/spark-kubernetes-operator/values.yaml`:

sink.PrometheusPullModelSink
```

* Install the Spark Operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Install the Spark Operator
* Install Spark Operator

@@ -0,0 +1,203 @@
## Spark Operator API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch!

We have disabled license header check for markdowns, but I added this back for files under /docs for consistency

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Let's merge this as the initial draft.

@dongjoon-hyun
Copy link
Member

Thank you, @jiangzho and @viirya .

jiangzho added a commit to jiangzho/spark-kubernetes-operator that referenced this pull request Jul 17, 2025
bump internal version to 0.4.0.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants