Skip to content

Commit

Permalink
[mellanox]: Backport patches to increase critical threshold for ASIC …
Browse files Browse the repository at this point in the history
…and validate transceiver temperature (#185)

Backport new patches to increase the ASIC critical threshold from 110C to 140C, and validate the transceiver critical threshold temperature:

1. 0022-mlxsw-core-Increase-critical-threshold-for-ASIC-ther.patch   torvalds/linux@b06ca3d
2. 0023-mlxsw-core-Add-validation-of-transceiver-temperature.patch  torvalds/linux@57726eb

This change has been verified on all Mellanox devices based on Spectrum-1, Spectrum-2, and Spectrum-3 ASIC

Signed-off-by: Kebo Liu <kebol@nvidia.com>
  • Loading branch information
keboliu committed Jan 12, 2021
1 parent a7c1af7 commit 548e8e0
Show file tree
Hide file tree
Showing 3 changed files with 98 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
From f79a25e99568d19ed2cd39de4650ced66de4ab5d Mon Sep 17 00:00:00 2001
From: Vadim Pasternak <vadimp@nvidia.com>
Date: Thu, 31 Dec 2020 19:27:02 +0200
Subject: [PATCH mlxsw/net-next 1/1] mlxsw: core: Increase critical threshold
for ASIC thermal zone

Increase critical threshold for ASIC thermal zone from 110C to 140C
according to the system hardware requirements. All the supported ASICs
(SX, Spectrum1, Spectune2, Spectrum3) could be still operational with
ASIC temperature below 140C.

According to the system requirements software thermal protection is the
second level of protection, while the first level of protection should
be performed by firmware. So firmware could decide to perform system
thermal shutdown in case the temperature is below 140C. So firmware can
decide to perform system thermal shutdown in case the temperature is
below 140C. In case firmware did not perform it and ASIC temperature
reached 140C, the second level of thermal protection will be performed
by software.

Fixes: 41e760841d26 ("mlxsw: core: Replace thermal temperature trips with defines")
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
---
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index 141e3655e211..d575aa469517 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -19,7 +19,7 @@
#define MLXSW_THERMAL_ASIC_TEMP_NORM 75000 /* 75C */
#define MLXSW_THERMAL_ASIC_TEMP_HIGH 85000 /* 85C */
#define MLXSW_THERMAL_ASIC_TEMP_HOT 105000 /* 105C */
-#define MLXSW_THERMAL_ASIC_TEMP_CRIT 110000 /* 110C */
+#define MLXSW_THERMAL_ASIC_TEMP_CRIT 140000 /* 140C */
#define MLXSW_THERMAL_MODULE_TEMP_NORM 60000 /* 60C */
#define MLXSW_THERMAL_MODULE_TEMP_HIGH 70000 /* 70C */
#define MLXSW_THERMAL_MODULE_TEMP_HOT 80000 /* 80C */
--
2.11.0

Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
From 8845c46138eca323a3fe9cd1332190f092f35888 Mon Sep 17 00:00:00 2001
From: Vadim Pasternak <vadimp@nvidia.com>
Date: Thu, 7 Jan 2021 12:56:21 +0200
Subject: [PATCH mlxsw/backport 2/2] mlxsw: core: Add validation of transceiver
temperature thresholds

Validate thresholds to avoid a single failure due to some transceiver
unreliability. Ignore the last readouts in case warning temperature is
above alarm temperature, since it can cause unexpected thermal
shutdown. Stay with the previous values and refresh threshold within
the next iteration.

This is the rare scenario, but somehow once it has been observed at a
customer site.

Fixes: 6a79507cfe94 ("mlxsw: core: Extend thermal module with per QSFP module thermal zones")
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
---
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index 54d0e8b8d..477c3ed53 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -183,6 +183,12 @@ mlxsw_thermal_module_trips_update(struct device *dev, struct mlxsw_core *core,
if (err)
return err;

+ if (crit_temp > emerg_temp) {
+ dev_warn(dev, "%s : Critical threshold %d is above emergency threshold %d\n",
+ tz->tzdev->type, crit_temp, emerg_temp);
+ return 0;
+ }
+
/* According to the system thermal requirements, the thermal zones are
* defined with four trip points. The critical and emergency
* temperature thresholds, provided by QSFP module are set as "active"
@@ -197,11 +203,8 @@ mlxsw_thermal_module_trips_update(struct device *dev, struct mlxsw_core *core,
tz->trips[MLXSW_THERMAL_TEMP_TRIP_NORM].temp = crit_temp;
tz->trips[MLXSW_THERMAL_TEMP_TRIP_HIGH].temp = crit_temp;
tz->trips[MLXSW_THERMAL_TEMP_TRIP_HOT].temp = emerg_temp;
- if (emerg_temp > crit_temp)
- tz->trips[MLXSW_THERMAL_TEMP_TRIP_CRIT].temp = emerg_temp +
+ tz->trips[MLXSW_THERMAL_TEMP_TRIP_CRIT].temp = emerg_temp +
MLXSW_THERMAL_MODULE_TEMP_SHIFT;
- else
- tz->trips[MLXSW_THERMAL_TEMP_TRIP_CRIT].temp = emerg_temp;

return 0;
}
--
2.11.0

2 changes: 2 additions & 0 deletions patch/series
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ driver-ixgbe-external-phy.patch
0019-mlxsw-i2c-Allow-flexible-setting-of-I2C-transactions.patch
0020-mlxsw-core-Set-different-thermal-polling-time-based.patch
0021-platform-x86-mlx-platform-Remove-PSU-EEPROM-configur.patch
0022-mlxsw-core-Increase-critical-threshold-for-ASIC-ther.patch
0023-mlxsw-core-Add-validation-of-transceiver-temperature.patch
############################################################
#
# Internal patches will be added below (placeholder)
Expand Down

0 comments on commit 548e8e0

Please sign in to comment.