Skip to content

Latest commit

 

History

History
165 lines (154 loc) · 7.94 KB

quality-tracking-metrics.md

File metadata and controls

165 lines (154 loc) · 7.94 KB

The current metric of measuring quality data onboarding / potential abuse consists of:

  • Data stored in deals where the CID is shared across client DataCap applications
    • $(n-1)*\text{size of CID}$, where n is the number of duplicates
    • If two clients use the same dataset, but preparation is different, then we are counting them as different CIDs.
    • This likely means that the data is being duplicated many times and the client is storing it with the same SP
  • Data stored in deals where the CID has been stored more than 2 times with the same SP ID
    • $(n-1)*\text{size of CID}$, where n is the number of duplicates
    • This likely means that the same data is being shared and stored by different clients
  • All data stored by a client address where the client address has used between 10% and 90% of their DataCap allocation, has not made any deals in the last 6 weeks, and has stored everything in unique CIDs
    • This likely means that this not being duplicated at all and therefore not being distributed enough

The following is the a query specifying how we will tag deals under these three flags and its explanation:

CREATE OR REPLACE FUNCTION epoch_to_timestamp(epoch INTEGER)
    RETURNS INTEGER AS $$
BEGIN
    RETURN epoch * 30 + 1598306400;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION tag_over_replicated_deals()
    RETURNS VOID AS $$
BEGIN
    INSERT INTO deal_tags (deal_id, cid_overreplicated, sector_start, piece_size)
    SELECT
        cs.deal_id,
        cs.row_number > 1 AS cid_overreplicated,
        TO_TIMESTAMP(epoch_to_timestamp(cs.sector_start_epoch)),
        cs.piece_size
    FROM (
             SELECT
                 *,
                 ROW_NUMBER() OVER (
                     PARTITION BY piece_cid, provider
                     ORDER BY sector_start_epoch
                     ) AS row_number
             FROM
                 current_state
             WHERE
                     verified_deal = true
               AND sector_start_epoch > 0
         ) AS cs
    ON CONFLICT (deal_id)
        DO UPDATE SET cid_overreplicated = EXCLUDED.cid_overreplicated,
                      sector_start = EXCLUDED.sector_start, piece_size = EXCLUDED.piece_size;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION tag_cid_sharing_deals()
    RETURNS VOID AS $$
BEGIN
    WITH owner_count AS (
        SELECT
            piece_cid,
            COUNT(DISTINCT owner) AS owner_count
        FROM (
                 SELECT
                     current_state.piece_cid,
                     COALESCE(github_client_mapping.unique_id::TEXT, current_state.client) AS owner
                 FROM
                     current_state
                         LEFT JOIN github_client_mapping ON current_state.client = github_client_mapping.client_address
                 WHERE
                         verified_deal = true
                   AND sector_start_epoch > 0
             ) AS distinct_owners
        GROUP BY piece_cid
    )
    INSERT INTO deal_tags (deal_id, cid_shared, sector_start, piece_size)
    SELECT
        cs.deal_id,
        (cs.row_number > 1 AND oc.owner_count > 1) AS cid_shared,
        TO_TIMESTAMP(epoch_to_timestamp(cs.sector_start_epoch)),
        cs.piece_size
    FROM (
             SELECT
                 cs_with_owner.*,
                 ROW_NUMBER() OVER (
                     PARTITION BY piece_cid, owner
                     ORDER BY sector_start_epoch
                     ) AS row_number
             FROM (
                      SELECT
                          current_state.*,
                          COALESCE(github_client_mapping.unique_id::TEXT, current_state.client) AS owner
                      FROM
                          current_state
                              LEFT JOIN github_client_mapping ON current_state.client = github_client_mapping.client_address
                      WHERE
                              verified_deal = true
                        AND sector_start_epoch > 0
                  ) AS cs_with_owner
         ) AS cs
             JOIN owner_count oc ON cs.piece_cid = oc.piece_cid
    ON CONFLICT (deal_id)
        DO UPDATE SET cid_shared = EXCLUDED.cid_shared,
                      sector_start = EXCLUDED.sector_start, piece_size = EXCLUDED.piece_size;
END;
$$ LANGUAGE plpgsql;


CREATE OR REPLACE FUNCTION tag_cid_unique_deals()
    RETURNS VOID AS $$
BEGIN
    WITH eligible_clients AS (
        SELECT
            COALESCE(gcm.unique_id::TEXT, cs.client) AS grouped_client,
            SUM(cs.piece_size) AS total_storage,
            MAX(cs.sector_start_epoch) AS max_sector_start_epoch
        FROM
            current_state cs
                LEFT JOIN github_client_mapping gcm ON cs.client = gcm.client_address
        WHERE
                cs.verified_deal = true
          AND cs.sector_start_epoch > 0
        GROUP BY
            grouped_client
        HAVING
                SUM(cs.piece_size) > (1::BIGINT * 1024 * 1024 * 1024 * 1024)
           AND (TO_TIMESTAMP(epoch_to_timestamp(MAX(cs.sector_start_epoch))) + interval '6 weeks') <= NOW()
    ),
         unique_deals AS (
             SELECT
                 cs.*,
                 COALESCE(gcm.unique_id::TEXT, cs.client) AS grouped_client,
                 ROW_NUMBER() OVER (
                     PARTITION BY COALESCE(gcm.unique_id::TEXT, cs.client), cs.piece_cid
                     ORDER BY cs.sector_start_epoch
                     ) AS row_number
             FROM
                 current_state cs
                     LEFT JOIN github_client_mapping gcm ON cs.client = gcm.client_address
             WHERE
                     cs.verified_deal = true
               AND cs.sector_start_epoch > 0
         )
    INSERT INTO deal_tags (deal_id, cid_unique, sector_start, piece_size)
    SELECT
        ud.deal_id,
        (ud.row_number = 1 AND ec.grouped_client IS NOT NULL) AS cid_unique,
        TO_TIMESTAMP(epoch_to_timestamp(ud.sector_start_epoch)),
        ud.piece_size
    FROM
        unique_deals ud
            LEFT JOIN eligible_clients ec ON ud.grouped_client = ec.grouped_client
    ON CONFLICT (deal_id)
        DO UPDATE SET
                      cid_unique = EXCLUDED.cid_unique,
                      sector_start = EXCLUDED.sector_start,
                      piece_size = EXCLUDED.piece_size;
END;
$$ LANGUAGE plpgsql;

This code defines four PostgreSQL functions using PL/pgSQL, a procedural language for PostgreSQL. These functions are used to tag and process deal data related to file storage.

epoch_to_timestamp(epoch INTEGER): This function takes an integer epoch as input and returns an integer. It converts the given epoch value to a timestamp by multiplying it by 30 and then adding the constant 1598306400. The returned value represents Unix time.

tag_over_replicated_deals(): This function tags deals that are over-replicated. It does so by inserting or updating deal tags in the deal_tags table based on the calculated cid_overreplicated value, which indicates whether the deal is over-replicated or not. The function also sets sector_start and piece_size columns for the corresponding deal.

tag_cid_sharing_deals(): This function tags deals that share content identifiers (CIDs) across multiple clients. It calculates the cid_shared value, which indicates whether the deal shares a CID with other deals, and inserts or updates the deal_tags table with this value. The function also sets sector_start and piece_size columns for the corresponding deal.

tag_cid_unique_deals(): This function tags deals with unique content identifiers (CIDs) for clients that have stored more than 1 TiB of data and have not added any new storage within the last 6 weeks. It calculates the cid_unique value, which indicates whether the deal has a unique CID or not, and inserts or updates the deal_tags table with this value. The function also sets sector_start and piece_size columns for the corresponding deal.

In summary, these four functions are designed to process and tag deal data in a PostgreSQL database, which can be useful for tracking and analyzing data storage deals.