Skip to content

PCP: PMDA: Introduce new PMDA for RDS #2230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mohith-kumar-thummaluru
Copy link
Contributor

This commit adds a new PMDA (Performance Metrics Domain Agent) for Reliable Datagram Sockets (RDS). It exports key metrics including connection information, socket and connection statistics, and details of send, receive, and retransmit queues for performance analysis using Performance Co-Pilot (PCP).

This PMDA is intended to aid in diagnosing network-related issues on systems using RDS over Infiniband or TCP.

This commit adds a new PMDA (Performance Metrics Domain Agent) for
Reliable Datagram Sockets (RDS). It exports key metrics including
connection information, socket and connection statistics, and details
of send, receive, and retransmit queues for performance analysis using
Performance Co-Pilot (PCP).

This PMDA is intended to aid in diagnosing network-related issues
on systems using RDS over Infiniband or TCP.

Signed-off-by: Mohith Kumar Thummaluru <mohith.k.kumar.thummaluru@oracle.com>
Copy link
Member

@natoscott natoscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distinct lack of QA tests for the PMDA but also for the helper scripts - please add. And
if we can use a standard module or command rather than using libc directly and adding the python ping implementation(s), that'd be better. Other comments inline.

@@ -161,7 +161,8 @@ QA_TEST_PERL 252
SIMPLE 253
MEMORY_PYTHON 254
# end QA section
### MORE FREE SLOTS 255..510 ###
RDS 256
### MORE FREE SLOTS 257..510 ###
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to 255?

os.symlink(filename, pyf)
os.chdir(PMDADIR)
os.chdir(CWD)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what is this script doing? It definitely shouldn't be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me recheck this


sys.path.append("/var/lib/pcp/pmdas/rds")
sys.path.append("/var/lib/pcp/pmdas/rds")
from modules.rds_ping import rds_ping_all_avlbl_dest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remove the hard-coded use of /var/lib/pcp/pmdas - this should be extracted via pmGetConfig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my dev env, i was facing trouble without these lines, can you point me to some examples on how to use pmGetConfig

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example:

PMDA_DIR = PCP.pmGetConfig('PCP_PMDAS_DIR')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

daemon_opt=false

# Prepare PMDA Python files (needed for easier packaging)
$PCP_PYTHON_PROG $PCP_PMDAS_DIR/rds/pyprep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this - see notes later regarding this script.

"""
# Centralized metric mapping
METRICS = {
# CLUSTER_CONN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General notes here apply to the metric specifications below. All of the units seem to be set to zero - seems unlikely? All of the metrics seem to be hard-coded as PM_SEM_INSTANT - are there really no counters? That'd be very unusual and based on the names, alot of them look like counters. There's no help text for any of the metrics, so I can't tell what they are. The vast majority seem to be specified as 32 bit values too - this is also very unusual these days, most things would be 64bit. The metrics specified with "8k" and "1m" are duplicated - these should be specified once and have instances of 8k/1m instead of a null instance domain.

import time
import json
from time import perf_counter
sys.path.append("/var/lib/pcp/pmdas/rds/modules")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coded PCP path here - use pmGetConfig.

sock_fd = socket.socket(socket.PF_RDS, socket.SOCK_SEQPACKET, 0)
return sock_fd
except socket.error as err:
print("socket creation failed with error %s" % (err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of print (-to-stdout) - these look like errors for stderr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll fix it

print()

def main(argv):
parser = argparse.ArgumentParser(description='python version of rds-ping utility.',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this code come from? All authored for this PR (and GPL2+) or is there another author/license involved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is authored for this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its OK to stick with python over C, particularly since the rest of the PMDA is python code. How do we know this code works though? (there's no tests, not even indirect tests using the agent to run the code, nor even pylint static checks yet - let's get these in-place asap). Its a lot of code to add to the project for everyone to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let me work on the test scripts.


# If the user requests help or an invalid option is passed
if option == "--help" or option not in infos:
print("Usage: python script.py <option>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

script.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, let me fix it

if __name__ == "__main__":
# Ensure an argument is passed
if len(sys.argv) != 2:
print("Usage: python rds_info.py <option>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this could share a usage message with the above script.py code?

@mohith-kumar-thummaluru
Copy link
Contributor Author

if we can use a standard module or command rather than using libc directly and adding the python ping implementation(s), that'd be better.

There were multiple reasons behind choosing the libc-based approach instead of relying on standard tools.

We wanted to eliminate dependencies on native utilities like ping and rds-ping, especially since our use case required functionality beyond what these tools offer. Our implementation provides an enhanced version of rds-ping, which supports sending pings to multiple connections using a single socket. This socket can be bound to a specific source address and ToS (Type of Service) value, significantly reducing overhead and simplifying connection management.

Additionally, the Python implementation of getsockopt allocates a fixed 1024-byte buffer when retrieving connection information from the kernel. While this is typically sufficient for generic socket information, it may be inadequate for extracting RDS protocol-specific data. Although CPython internally invokes the C getsockopt system call, it ignores the return value, which is critical for parsing the data bytes returned by the kernel. This omission can lead to incorrect parsing or missed information. In contrast, the direct C implementation using libc respects the return value and imposes no such buffer limitation, allowing for more accurate and flexible data extraction.

Considering all these factors, we implemented the required tools directly using libc to gain better control, accuracy, and efficiency for our specific needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants