-
-
Notifications
You must be signed in to change notification settings - Fork 245
PCP: PMDA: Introduce new PMDA for RDS #2230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
PCP: PMDA: Introduce new PMDA for RDS #2230
Conversation
This commit adds a new PMDA (Performance Metrics Domain Agent) for Reliable Datagram Sockets (RDS). It exports key metrics including connection information, socket and connection statistics, and details of send, receive, and retransmit queues for performance analysis using Performance Co-Pilot (PCP). This PMDA is intended to aid in diagnosing network-related issues on systems using RDS over Infiniband or TCP. Signed-off-by: Mohith Kumar Thummaluru <mohith.k.kumar.thummaluru@oracle.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Distinct lack of QA tests for the PMDA but also for the helper scripts - please add. And
if we can use a standard module or command rather than using libc directly and adding the python ping implementation(s), that'd be better. Other comments inline.
@@ -161,7 +161,8 @@ QA_TEST_PERL 252 | |||
SIMPLE 253 | |||
MEMORY_PYTHON 254 | |||
# end QA section | |||
### MORE FREE SLOTS 255..510 ### | |||
RDS 256 | |||
### MORE FREE SLOTS 257..510 ### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to 255?
os.symlink(filename, pyf) | ||
os.chdir(PMDADIR) | ||
os.chdir(CWD) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what is this script doing? It definitely shouldn't be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me recheck this
|
||
sys.path.append("/var/lib/pcp/pmdas/rds") | ||
sys.path.append("/var/lib/pcp/pmdas/rds") | ||
from modules.rds_ping import rds_ping_all_avlbl_dest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remove the hard-coded use of /var/lib/pcp/pmdas - this should be extracted via pmGetConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my dev env, i was facing trouble without these lines, can you point me to some examples on how to use pmGetConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example:
PMDA_DIR = PCP.pmGetConfig('PCP_PMDAS_DIR') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
daemon_opt=false | ||
|
||
# Prepare PMDA Python files (needed for easier packaging) | ||
$PCP_PYTHON_PROG $PCP_PMDAS_DIR/rds/pyprep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this - see notes later regarding this script.
""" | ||
# Centralized metric mapping | ||
METRICS = { | ||
# CLUSTER_CONN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General notes here apply to the metric specifications below. All of the units seem to be set to zero - seems unlikely? All of the metrics seem to be hard-coded as PM_SEM_INSTANT - are there really no counters? That'd be very unusual and based on the names, alot of them look like counters. There's no help text for any of the metrics, so I can't tell what they are. The vast majority seem to be specified as 32 bit values too - this is also very unusual these days, most things would be 64bit. The metrics specified with "8k" and "1m" are duplicated - these should be specified once and have instances of 8k/1m instead of a null instance domain.
import time | ||
import json | ||
from time import perf_counter | ||
sys.path.append("/var/lib/pcp/pmdas/rds/modules") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard-coded PCP path here - use pmGetConfig.
sock_fd = socket.socket(socket.PF_RDS, socket.SOCK_SEQPACKET, 0) | ||
return sock_fd | ||
except socket.error as err: | ||
print("socket creation failed with error %s" % (err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of print (-to-stdout) - these look like errors for stderr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll fix it
print() | ||
|
||
def main(argv): | ||
parser = argparse.ArgumentParser(description='python version of rds-ping utility.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this code come from? All authored for this PR (and GPL2+) or is there another author/license involved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is authored for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but there a C implementation of it available - https://github.com/oracle/rds-tools/blob/rdma-vos/rds-tools-2/rds-ping.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its OK to stick with python over C, particularly since the rest of the PMDA is python code. How do we know this code works though? (there's no tests, not even indirect tests using the agent to run the code, nor even pylint static checks yet - let's get these in-place asap). Its a lot of code to add to the project for everyone to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let me work on the test scripts.
|
||
# If the user requests help or an invalid option is passed | ||
if option == "--help" or option not in infos: | ||
print("Usage: python script.py <option>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
script.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, let me fix it
if __name__ == "__main__": | ||
# Ensure an argument is passed | ||
if len(sys.argv) != 2: | ||
print("Usage: python rds_info.py <option>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this could share a usage message with the above script.py code?
There were multiple reasons behind choosing the libc-based approach instead of relying on standard tools. We wanted to eliminate dependencies on native utilities like ping and rds-ping, especially since our use case required functionality beyond what these tools offer. Our implementation provides an enhanced version of rds-ping, which supports sending pings to multiple connections using a single socket. This socket can be bound to a specific source address and ToS (Type of Service) value, significantly reducing overhead and simplifying connection management. Additionally, the Python implementation of getsockopt allocates a fixed 1024-byte buffer when retrieving connection information from the kernel. While this is typically sufficient for generic socket information, it may be inadequate for extracting RDS protocol-specific data. Although CPython internally invokes the C getsockopt system call, it ignores the return value, which is critical for parsing the data bytes returned by the kernel. This omission can lead to incorrect parsing or missed information. In contrast, the direct C implementation using libc respects the return value and imposes no such buffer limitation, allowing for more accurate and flexible data extraction. Considering all these factors, we implemented the required tools directly using libc to gain better control, accuracy, and efficiency for our specific needs. |
This commit adds a new PMDA (Performance Metrics Domain Agent) for Reliable Datagram Sockets (RDS). It exports key metrics including connection information, socket and connection statistics, and details of send, receive, and retransmit queues for performance analysis using Performance Co-Pilot (PCP).
This PMDA is intended to aid in diagnosing network-related issues on systems using RDS over Infiniband or TCP.