Skip to content

adding pre import support for a CDB domain

craigmcchesney edited this page Jul 3, 2020 · 10 revisions

Table of Contents

overview

The Python pre-import script transforms custom Excel spreadsheet formats to the standard CDB import formats for CDB domains. It was developed to support the cable data collection process at APS, but might serve as a starting point for similar projects elsewhere.

The pre-import process reads the input spreadsheet one row at a time. A row dictionary is created for each input row, with values for each column of interest. A domain-specific helper class is used to customize the pre-import process. The helper creates a domain-specific output object from the row dictionary created by the parser. After parsing completes, we have a list of output objects, one for each row in the input spreadsheet. Finally, we iterate through the list of output objects and write a row to the output spreadsheet for each, using a domain-specific mapping generated by the helper to produce the spreadsheet columns with values from the output object.

The framework provides base classes that are overridden to provide pre-import support for a particular domain, PreImportHelper and OutputObject. The steps to add support for a domain are:

1. create PreImportHelper subclass

2. create OutputObject subclass

Each is described in more detail below.

1. create PreImportHelper subclass

You must create a subclass of PreImportHelper. This is the key framework class for handling a domain object. There are required abstract methods whose implementation is required, and optional methods that can be overridden to customize behavior. This will be discussed further below.

required abstract method overrides

tag()

Returns the name of the domain class that the helper is registered for. This must match the command-line argument "--type".

    @staticmethod
    def tag():
        return "Source"

num_input_columns()

Returns the number of input columns expected in the spreadsheet. This is used for validation, to check that the specified spreadsheet contains the expected number of columns. The actual input_column_list() might not handle each of the input columns, it might choose to ignore some.

    @classmethod
    def num_input_cols(cls):
        return 17

input_column_list()

Returns a list of InputColumnModel objects, each specifying a column index, dictionary key name, and optionally a flag indicating whether or not the column is required. In parsing the input spreadsheet, a row value dictionary is created with keys and values for each InputColumnModel in this list. The list should include an InputColumnModel for each column of interest from the custom input spreadsheet format.

    @classmethod
    def input_column_list(cls):
        column_list = [
            InputColumnModel(col_index=0, key=CABLE_TYPE_NAME_KEY, required=True),
            InputColumnModel(col_index=1, key=CABLE_TYPE_DESCRIPTION_KEY),
            InputColumnModel(col_index=2, key=CABLE_TYPE_MANUFACTURER_KEY),
            InputColumnModel(col_index=3, key=CABLE_TYPE_PART_NUMBER_KEY),
        ]
        return column_list

output_column_list()

Returns a list of OutputColumnModel objects, each specifying a column index, method, and label. As the input spreadsheet is parsed, an instance of the OutputObject (described in more detail below) is created for each row of the input and added to a list. The list should contain an OutputColumnModel for each column in the corresponding official CDB import format for the domain. To produce the output spreadsheet, the OutputColumnModel list is used to process each item in the OutputObject list, invoking the specified method on the OutputObject to obtain a value that is written to the column with specified index. The specified label is used in the header row to label that column.

    @staticmethod
    def output_column_list():
        column_list = [
            OutputColumnModel(col_index=0, method="get_name", label="Name"),
            OutputColumnModel(col_index=1, method="get_description", label="Description"),
            OutputColumnModel(col_index=2, method="get_contact_info", label="Contact Info"),
            OutputColumnModel(col_index=3, method="get_url", label="URL"),
        ]
        return column_list

get_output_object()

The row dictionary created during parsing for a row from the input spreadsheet is passed to this method. The method returns an instance of the OutputObject subclass (discussed below). It might also perform custom validation on the row dictionary. Here are two examples, one minimal implementation and the other customized.

    def get_output_object(self, input_dict):

        logging.debug("adding output object for: %s" % input_dict[CABLE_TYPE_NAME_KEY])
        return CableDesignOutputObject(helper=self, input_dict=input_dict)

In the following example, each row specifies a manufacturer. If we've already encountered a manufacturer in a previous row, we return "None", which indicates that the new row should be ignored.

    def get_output_object(self, input_dict):

        manufacturer = input_dict[CABLE_TYPE_MANUFACTURER_KEY]
        if len(manufacturer) == 0:
            return None

        if manufacturer not in self.manufacturers:
            # check to see if manufacturer exists as a CDB Source
            try:
                mfr_source = self.api.getSourceApi().get_source_by_name(manufacturer)
            except ApiException as ex:
                if "ObjectNotFound" not in ex.body:
                    print("exception retrieving source for manufacturer: %s - %s" % (manufacturer, ex.body))
                mfr_source = None
            if mfr_source:
                logging.debug("source already exists for manufacturer: %s, skipping" % manufacturer)
                return None
            else:
                logging.debug("adding output object for unique manufacturer: %s" % manufacturer)
                self.manufacturers.add(manufacturer)
                return SourceOutputObject(helper=self, input_dict=input_dict)

        else:
            logging.debug("ignoring duplicate manufacturer: %s" % manufacturer)
            return None

optional method overrides

add_parser_args()

Allows the subclass to specify additional domain-specific command line arguments.

    @staticmethod
    def add_parser_args(parser):
        parser.add_argument("--ownerId", help="CDB technical system ID for owner", required=True)

set_args()

Allows subclass to print information about domain-specific command line arguments, and capture their values to local helper instance variables.

    def set_args(self, args):
        super().set_args(args)
        print("top-level parent machine design node name: %s" % args.mdRoot)
        print("debugging info xlsx file: %s" % args.infoFile)
        self.md_root = args.mdRoot
        self.info_file = args.infoFile

input_row_is_empty_custom()

Allows subclass to determine if a row can be treated as blank in a domain-specific way. By default, a row is blank if all the cells are empty. But in some cases, we want to allow a blank row to contain a value in a single column that contains a sequential row identifier as shown below:

    # Treat a row that contains a single non-empty value in the "import id" column as an empty row.
    def input_row_is_empty_custom(self, input_dict, row_num):
        non_empty_count = sum([1 for val in input_dict.values() if len(str(val)) > 0])
        if non_empty_count == 1 and len(str(input_dict[CABLE_DESIGN_IMPORT_ID_KEY])) > 0:
            logging.debug("skipping empty row with non-empty import id: %s row: %d" %
                          (input_dict[CABLE_DESIGN_IMPORT_ID_KEY], row_num))
            return True

input_row_is_valid_custom()

Allows a subclass to perform custom validation on the input row dictionary. By default, a row is valid if it contains values for all InputColumnModel objects that are marked as required.

close()

Allows a subclass to take action at the end of the pre-import process, such as to write a specialized log file as shown below.

    def close(self):
        if len(self.missing_cable_types) > 0 or len(self.missing_endpoints) > 0 or len(self.nonunique_endpoints) > 0:
            output_book = xlsxwriter.Workbook(self.info_file)
            output_sheet = output_book.add_worksheet()

            output_sheet.write(0, 0, "missing cable types")
            output_sheet.write(0, 1, "missing endpoints")
            output_sheet.write(0, 2, "non-unique endpoints")

            row_index = 1
            for cable_type_name in self.missing_cable_types:
                output_sheet.write(row_index, 0, cable_type_name)
                row_index = row_index + 1

            row_index = 1
            for endpoint_name in self.missing_endpoints:
                output_sheet.write(row_index, 1, endpoint_name)
                row_index = row_index + 1

            row_index = 1
            for endpoint_name in self.nonunique_endpoints:
                output_sheet.write(row_index, 2, endpoint_name)
                row_index = row_index + 1

            output_book.close()

2. create OutputObject subclass

As described above, an instance of the domain-specific OutputObject subclass is created for each row in the input spreadsheet. It's main purpose is to provide getter methods for each column in the output spreadsheet, as determined by the output_column_list() method (discussed previously). Each OutputColumnModel object in the list returned by that method specifies a getter method name that must be implemented by the domain-specific OutputObject subclass.

The row dictionary created by the parser is a parameter of the OutputObject constructor, and saved as an instance variable. In some cases, the getter methods just return values from that dictionary to simply pass them through from the input spreadsheet to the output spreadsheet. In other cases, a CDB API query is performed using the input value to produce an output value. For example, we might need to lookup the CDB internal object ID for an item from its name.

There are no required abstract methods or optional overrides for OutputObject subclasses. They are purely domain-specific.

Here is an example of a simple, pass-through getter method that returns a value from the input row dictionary:

    def get_voltage(self):
        return self.input_dict[CABLE_DESIGN_VOLTAGE_KEY]

The second example is a more complex case where we perform a CDB API query using the cable type name to find the ID of the corresponding cable catalog object. If no object with that name is found, we want to mark the row as invalid but continue processing so that we can produce a log file that identifies all the problems in the spreadsheet instead of failing for an individual row. It sets the global flag "isValid" to False and adds the name to a list of missing cable types for use in the log file.

    def get_cable_type_id(self):

        global isValid

        cable_type_name = self.input_dict[CABLE_DESIGN_TYPE_KEY]

        if cable_type_name == "" or cable_type_name is None:
            return ""
        
        if self.helper.has_cable_type(cable_type_name):
            return self.helper.get_id_for_cable_type(cable_type_name)
        else:
            # check to see if cable type exists in CDB by name
            cable_type_object = None
            try:
                cable_type_object = self.helper.api.getCableCatalogItemApi().get_cable_catalog_item_by_name(cable_type_name)
            except ApiException as ex:
                if "ObjectNotFound" not in ex.body:
                    error_msg = "exception retrieving cable catalog item: %s - %s" % (cable_type_name, ex.body)
                    logging.error(error_msg)
                    sys.exit(msg)
                else:
                    isValid = False
                    self.helper.add_missing_cable_type(cable_type_name)
                    logging.error("ObjectNotFound exception for cable type with name: %s" % cable_type_name)
                    return None

            if cable_type_object:
                cable_type_id = cable_type_object.id
                logging.debug("found cable type with name: %s, id: %s" % (cable_type_name, cable_type_id))
                self.helper.set_id_for_cable_type(cable_type_name, cable_type_id)
                return cable_type_id
            else:
                isValid = False
                self.helper.add_missing_cable_type(cable_type_name)
                logging.error("cable_type_object from API result is None for name: %s" % cable_type_name)
                return None
Clone this wiki locally