Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/troubleshooting: add sections to cover all post-create phases #1329

Merged
merged 1 commit into from
Feb 28, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
235 changes: 235 additions & 0 deletions docs/user/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,237 @@ This is safe to ignore and merely indicates that the etcd bootstrapping is still

The easiest way to get more debugging information from the installer is to check the log file (`.openshift_install.log`) in the install directory. Regardless of the logging level specified, the installer will write its logs in case they need to be inspected retroactively.

### Installer Fails to Initialize the Cluster

The installer uses the [cluster-version-operator] to create all the components of an OpenShift cluster. When the installer fails to initialize the cluster, the most important information can be fetched by looking at the [ClusterVersion][clusterversion] and [ClusterOperator][clusteroperator] objects:

1. Inspecting the `ClusterVersion` object.

```console
$ oc --config=${INSTALL_DIR}/auth/kubeconfig get clusterversion -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
creationTimestamp: 2019-02-27T22:24:21Z
generation: 1
name: version
resourceVersion: "19927"
selfLink: /apis/config.openshift.io/v1/clusterversions/version
uid: 6e0f4cf8-3ade-11e9-9034-0a923b47ded4
spec:
channel: stable-4.0
clusterID: 5ec312f9-f729-429d-a454-61d4906896ca
upstream: https://api.openshift.com/api/upgrades_info/v1/graph
status:
availableUpdates: null
conditions:
- lastTransitionTime: 2019-02-27T22:50:30Z
message: Done applying 4.0.0-0.alpha-2019-02-27-131049
status: "True"
type: Available
- lastTransitionTime: 2019-02-27T22:50:30Z
status: "False"
type: Failing
- lastTransitionTime: 2019-02-27T22:50:30Z
message: Cluster version is 4.0.0-0.alpha-2019-02-27-131049
status: "False"
type: Progressing
- lastTransitionTime: 2019-02-27T22:24:31Z
message: 'Unable to retrieve available updates: unknown version 4.0.0-0.alpha-2019-02-27-131049'
reason: RemoteFailed
status: "False"
type: RetrievedUpdates
desired:
image: registry.svc.ci.openshift.org/openshift/origin-release@sha256:91e6f754975963e7db1a9958075eb609ad226968623939d262d1cf45e9dbc39a
version: 4.0.0-0.alpha-2019-02-27-131049
history:
- completionTime: 2019-02-27T22:50:30Z
image: registry.svc.ci.openshift.org/openshift/origin-release@sha256:91e6f754975963e7db1a9958075eb609ad226968623939d262d1cf45e9dbc39a
startedTime: 2019-02-27T22:24:31Z
state: Completed
version: 4.0.0-0.alpha-2019-02-27-131049
observedGeneration: 1
versionHash: Wa7as_ik1qE=
```

Some of most important [conditions][cluster-operator-conditions] to take note are `Failing`, `Available` and `Progressing`. You can look at the conditions using:

```console
$ oc --config=${INSTALL_DIR}/auth/kubeconfig get clusterversion version -o=jsonpath='{range .status.conditions[*]}{.type}{" "}{.status}{" "}{.message}{"\n"}{end}'
Available True Done applying 4.0.0-0.alpha-2019-02-26-194020
Failing False
Progressing False Cluster version is 4.0.0-0.alpha-2019-02-26-194020
RetrievedUpdates False Unable to retrieve available updates: unknown version 4.0.0-0.alpha-2019-02-26-194020
```

2. Inspecting the `ClusterOperator` object.

You can get the status of all the cluster operators:

```console
$ oc --config=${INSTALL_DIR}/auth/kubeconfig get clusteroperator
NAME VERSION AVAILABLE PROGRESSING FAILING SINCE
cluster-autoscaler True False False 17m
cluster-storage-operator True False False 10m
console True False False 7m21s
dns True False False 31m
image-registry True False False 9m58s
ingress True False False 10m
kube-apiserver True False False 28m
kube-controller-manager True False False 21m
kube-scheduler True False False 25m
machine-api True False False 17m
machine-config True False False 17m
marketplace-operator True False False 10m
monitoring True False False 8m23s
network True False False 13m
node-tuning True False False 11m
openshift-apiserver True False False 15m
openshift-authentication True False False 20m
openshift-cloud-credential-operator True False False 18m
openshift-controller-manager True False False 10m
openshift-samples True False False 8m42s
operator-lifecycle-manager True False False 17m
service-ca True False False 30m
```

To get detailed information on why an individual cluster operator is `Failing` or not yet `Available`, you can check the status of that individual operator, for example `monitoring`:

```console
$ oc --config=${INSTALL_DIR}/auth/kubeconfig get clusteroperator monitoring -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
creationTimestamp: 2019-02-27T22:47:04Z
generation: 1
name: monitoring
resourceVersion: "24677"
selfLink: /apis/config.openshift.io/v1/clusteroperators/monitoring
uid: 9a6a5ef9-3ae1-11e9-bad4-0a97b6ba9358
spec: {}
status:
conditions:
- lastTransitionTime: 2019-02-27T22:49:10Z
message: Successfully rolled out the stack.
status: "True"
type: Available
- lastTransitionTime: 2019-02-27T22:49:10Z
status: "False"
type: Progressing
- lastTransitionTime: 2019-02-27T22:49:10Z
status: "False"
type: Failing
extension: null
relatedObjects: null
version: ""
```

Again, the cluster operators also publish [conditions][cluster-operator-conditions] like `Failing`, `Available` and `Progressing` that can help user provide information on the current state of the operator:

```console
$ oc --config=${INSTALL_DIR}/auth/kubeconfig get clusteroperator monitoring -o=jsonpath='{range .status.conditions[*]}{.type}{" "}{.status}{" "}{.message}{"\n"}{end}'
Available True Successfully rolled out the stack
Progressing False
Failing False
```

Each clusteroperator also publishes the list of objects owned by the cluster operator. To get that information:

```console
$ oc --config=${INSTALL_DIR}/auth/kubeconfig get clusteroperator kube-apiserver -o=jsonpath='{.status.relatedObjects}'
[map[resource:kubeapiservers group:operator.openshift.io name:cluster] map[group: name:openshift-config resource:namespaces] map[group: name:openshift-config-managed resource:namespaces] map[group: name:openshift-kube-apiserver-operator resource:namespaces] map[group: name:openshift-kube-apiserver resource:namespaces]]
```

**NOTE:** Failing to initialize the cluster is usually not a fatal failure in terms of cluster creation as the user can look at the failures from `ClusterOperator` to debug failures for a cluster operator and take actions which can allow `cluster-version-operator` to make progress.

### Installer Fails to Fetch Console URL

The installer fetches the URL for OpenShift console using the [route][route-object] in `openshift-console` namespace. If the installer fails the fetch the URL for the console:

1. Check if the console router is `Available` or `Failing`

```console
$ oc --config=${INSTALL_DIR}/auth/kubeconfig get clusteroperator console -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
creationTimestamp: 2019-02-27T22:46:57Z
generation: 1
name: console
resourceVersion: "19682"
selfLink: /apis/config.openshift.io/v1/clusteroperators/console
uid: 960364aa-3ae1-11e9-bad4-0a97b6ba9358
spec: {}
status:
conditions:
- lastTransitionTime: 2019-02-27T22:46:58Z
status: "False"
type: Failing
- lastTransitionTime: 2019-02-27T22:50:12Z
status: "False"
type: Progressing
- lastTransitionTime: 2019-02-27T22:50:12Z
status: "True"
type: Available
- lastTransitionTime: 2019-02-27T22:46:57Z
status: "True"
type: Upgradeable
extension: null
relatedObjects:
- group: operator.openshift.io
name: cluster
resource: consoles
- group: config.openshift.io
name: cluster
resource: consoles
- group: oauth.openshift.io
name: console
resource: oauthclients
- group: ""
name: openshift-console-operator
resource: namespaces
- group: ""
name: openshift-console
resource: namespaces
versions: null
```

2. Manually get the URL for `console`

```console
$ oc --config=${INSTALL_DIR}/auth/kubeconfig get route console -n openshift-console -o=jsonpath='{.spec.host}'
console-openshift-console.apps.adahiya-1.devcluster.openshift.com
```

### Installer Fails to Add Route CA to Kubeconfig

The installer adds the CA certificate for the router to the list of trusted client certificate authorities in `${INSTALL_DIR}/auth/kubeconfig`. If the installer fails to add the router CA to `kubeconfig`, you can fetch the router CA from the cluster using:

```console
$ oc --config=${INSTALL_DIR}/auth/kubeconfig get configmaps router-ca -n openshift-config-managed -o=jsonpath='{.data.ca-bundle\.crt}'
-----BEGIN CERTIFICATE-----
MIIC/TCCAeWgAwIBAgIBATANBgkqhkiG9w0BAQsFADAuMSwwKgYDVQQDDCNjbHVz
dGVyLWluZ3Jlc3Mtb3BlcmF0b3JAMTU1MTMwNzU4OTAeFw0xOTAyMjcyMjQ2Mjha
Fw0yMTAyMjYyMjQ2MjlaMC4xLDAqBgNVBAMMI2NsdXN0ZXItaW5ncmVzcy1vcGVy
YXRvckAxNTUxMzA3NTg5MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
uCA4fQ+2YXoXSUL4h/mcvJfrgpBfKBW5hfB8NcgXeCYiQPnCKblH1sEQnI3VC5Pk
2OfNCF3PUlfm4i8CHC95a7nCkRjmJNg1gVrWCvS/ohLgnO0BvszSiRLxIpuo3C4S
EVqqvxValHcbdAXWgZLQoYZXV7RMz8yZjl5CfhDaaItyBFj3GtIJkXgUwp/5sUfI
LDXW8MM6AXfuG+kweLdLCMm3g8WLLfLBLvVBKB+4IhIH7ll0buOz04RKhnYN+Ebw
tcvFi55vwuUCWMnGhWHGEQ8sWm/wLnNlOwsUz7S1/sW8nj87GFHzgkaVM9EOnoNI
gKhMBK9ItNzjrP6dgiKBCQIDAQABoyYwJDAOBgNVHQ8BAf8EBAMCAqQwEgYDVR0T
AQH/BAgwBgEB/wIBADANBgkqhkiG9w0BAQsFAAOCAQEAq+vi0sFKudaZ9aUQMMha
CeWx9CZvZBblnAWT/61UdpZKpFi4eJ2d33lGcfKwHOi2NP/iSKQBebfG0iNLVVPz
vwLbSG1i9R9GLdAbnHpPT9UG6fLaDIoKpnKiBfGENfxeiq5vTln2bAgivxrVlyiq
+MdDXFAWb6V4u2xh6RChI7akNsS3oU9PZ9YOs5e8vJp2YAEphht05X0swA+X8V8T
C278FFifpo0h3Q0Dbv8Rfn4UpBEtN4KkLeS+JeT+0o2XOsFZp7Uhr9yFIodRsnNo
H/Uwmab28ocNrGNiEVaVH6eTTQeeZuOdoQzUbClElpVmkrNGY0M42K0PvOQ/e7+y
AQ==
-----END CERTIFICATE-----
```

You can then **prepend** that certificate to `client-certificate-authority-data` field in your `${INSTALL_DIR}/auth/kubeconfig`.

## Generic Troubleshooting

Here are some ideas if none of the [common failures](#common-failures) match your symptoms.
Expand Down Expand Up @@ -263,3 +494,7 @@ If appropriate, file a [network operator](https://github.com/openshift/cluster-n
[aws-key-pairs]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html
[kubernetes-debug]: https://kubernetes.io/docs/tasks/debug-application-cluster/
[machine-config-daemon-ssh-keys]: https://github.com/openshift/machine-config-operator/blob/master/docs/Update-SSHKeys.md
[cluster-version-operator]: https://github.com/openshift/cluster-version-operator/blob/master/README.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I usually keep these references alphabetized, but that's not worth holding up this commit

[clusterversion]: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusterversion.md
[clusteroperator]: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md
[cluster-operator-conditions]: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions