Skip to content

TrustAIRLab/GPTracker

Repository files navigation

👣 GPTracker

S&P: 2025 license

This is the official repository of the IEEE S&P 2025 paper GPTracker: A Large-Scale Measurement of Misused GPTs by Xinyue Shen, Yun Shen, Michael Backes, and Yang Zhang.

In this paper, we introduce GPTracker, a framework designed to continuously collect GPTs from the official GPT Store and automate the interaction with them.

✨ Key Features of GPTracker:

  • Directly collected from the official GPT Store: All metadata are directly collected from the official GPT Store, rather than from third-party GPT collection websites, thereby avoiding issues such as data loss and delayed synchronization.
  • Bi-Weekly tracking: GPTracker has been running bi-weekly since March 26, 2024. We sincerely hope these continuous efforts can help the research community better understand the evolving landscape of GPTs.

📰 Update Logs

Date Number of GPTs Available GPTs Unavailable GPTs
2024-10-23 755297 716389 38908
2024-10-09 744661 708055 36606
2024-09-25 721891 689951 31940
2024-09-11 720477 690072 30405
2024-08-28 709295 681210 28085
2024-08-14 696592 670192 26400
2024-07-31 685431 660685 24746
2024-07-17 674331 651022 23309
2024-07-03 627075 613595 13480
2024-06-19 618650 605173 13477
2024-06-05 612173 598693 13480
2024-05-22 598550 584971 13579
2024-05-08 580135 570070 10065
2024-04-24 564456 557004 7452
2024-04-10 544224 540762 3462
2024-03-26 511476 511476 -

🛢️ Data

All collected metadata is available in the Releases.

We use zstd to compress the original data to save space.

To decompress, use the following command:

 zstd -d -k -T0 all_2024-10-23.csv.zst -o all_2024-10-23.csv
  • -k: Keep the original compressed file.
  • -T0: Use all available CPU threads to speed up decompression.

After decompression, you will get a file named all_{round_date}.csv, which contains the following 5 columns:

Column Description
gizmo_id Unique GPT ID, e.g., g-9upSdCEcq
from search (discovered via keyword search) or fill (accessed directly via URL), check the paper for details.
json GPT metadata (see example below)
status available or unavailable during that round
round Round date (runs bi-weekly), e.g., 2024-10-23

Note:

  • To protect personal information, we anonymize the details of the linked social media accounts, storing only the keyword linkedin, github, and X in the "display_socials" field to indicate whether the authors have provided them.
  • The data structure of the 2024-03-26 round differs from later rounds because it was the crawler’s first run. We optimized the storage logic in subsequent rounds. However, this does not affect our analysis, as the data from this round still contains key information.

Example of the json field:

{
    "gizmo": {
        "id": "g-9upSdCEcq",
        "organization_id": "org-pKunmlIixNlYhJKR8QfVDKUc",
        "short_url": "g-9upSdCEcq-anschreiben-generator",
        "author": {
            "user_id": "user-sgmDTtNbVFXMroGYwKOPHjJx",
            "display_name": "Ruff_consult GmbH",
            "link_to": null,
            "is_verified": true,
            "selected_display": "name",
            "will_receive_support_emails": null,
            "display_socials": []
        },
        "voice": {
            "id": "ember"
        },
        "workspace_id": null,
        "model": null,
        "instructions": null,
        "settings": null,
        "display": {
            "name": "Anschreiben Generator",
            "description": "Hilft dabei perfekte Anschreiben bzw. Motivationsschreiben zu generieren.",
            "prompt_starters": [
                "Helfe mir bei meiner Bewerbung!"
            ],
            "profile_pic_id": null,
            "profile_picture_url": null,
            "categories": [
                "writing"
            ]
        },
        "share_recipient": "marketplace",
        "created_at": "2024-03-01T09:47:39.027158+00:00",
        "updated_at": "2024-04-09T08:09:09.110264+00:00",
        "last_interacted_at": null,
        "num_interactions": null,
        "tags": [
            "public",
            "reportable"
        ],
        "version": null,
        "live_version": null,
        "training_disabled": null,
        "sharing_targets": null,
        "appeal_info": null,
        "vanity_metrics": {
            "num_conversations": null,
            "num_conversations_str": "80+",
            "created_ago_str": "\u4e0a\u4e2a\u6708",
            "review_stats": {
                "total": 0,
                "count": 0,
                "by_rating": {}
            }
        },
        "workspace_approval_date": null,
        "workspace_approved": null,
        "sharing": null,
        "current_user_permission": {
            "can_read": true,
            "can_view_config": false,
            "can_write": false,
            "can_delete": false,
            "can_export": false,
            "can_share": false
        }
    },
    "tools": [
        {
            "id": "gzm_cnf_XrcAlcB8rgJVe8gNF1qXrCmM~gzm_tool_j9iAVllphfBxn96ZfJhZI4gJ",
            "type": "dalle",
            "settings": null,
            "metadata": null
        },
        {
            "id": "gzm_cnf_XrcAlcB8rgJVe8gNF1qXrCmM~gzm_tool_7QQ65yi4xZixLmbRvuOdlU7w",
            "type": "browser",
            "settings": null,
            "metadata": null
        }
    ],
    "files": [],
    "product_features": {
        "attachments": {
            "type": "retrieval",
            "accepted_mime_types": [
                "text/plain",
                "text/x-script.python",
                "text/x-typescript",
                "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
                "text/x-sh",
                "application/json",
                "text/x-c++",
                "text/x-ruby",
                "text/x-tex",
                "text/x-csharp",
                "application/pdf",
                "text/markdown",
                "text/x-c",
                "text/javascript",
                "text/html",
                "application/x-latext",
                "text/x-java",
                "text/x-php",
                "application/msword",
                "application/vnd.openxmlformats-officedocument.presentationml.presentation"
            ],
            "image_mime_types": [
                "image/png",
                "image/gif",
                "image/jpeg",
                "image/webp"
            ],
            "can_accept_all_mime_types": true
        }
    }
}

🕷️ Crawler Logic

Each crawling round consists of two steps:

  1. We use 10,000 most common English words as the search terms and a web crawler powered by Playwright to retrieve each word to the GPT Store search interface and collect the metadata of returned GPTs.
  2. If a previously collected GPT does not appear in the current search round (possibly because the builder has made it private), we proactively visit the URL of the GPT to obtain its status in that round.

🤖 LLM-Driven Scoring System

We leverage ChatGPT to score the risks of GPTs.

Install the essential packages:

pip install pandas numpy tqdm openai

Put your OpenAI API Key in run_llm_driven_scoring_system.py

APIKEY = ""

Then run

python run_llm_driven_scoring_system.py --round_date 2024-10-23

🛡️ Ethics & Disclosure

Our study involves online data collection on the GPT Store, which could raise legal and ethical considerations. First, our study has been approved by our institution’s Institutional Review Board (IRB). In close collaboration with our Data Protection and Management Department, we formulated a data management plan to ensure our study complies with GDPR. Throughout all steps, only the authors of this paper conduct the annotations, and no external participants are involved. We have also taken utmost care to ensure that our testing does not disrupt services, harm users, or cause any unintentional damage. Specifically, we query GPTs using four registered accounts, strictly adhering to query limitations. We explicitly disable the “improve the model for everyone” setting for all accounts to opt out of model training. According to OpenAI, all interactions with GPT are invisible to the GPT builder and other users, ensuring the harmless of our queries. We also delete the chat history to minimize the impact on target platforms after each query session. During our experiments, we responsibly disclosed our findings to OpenAI.

This repo is intended for research purposes only. Any misuse is strictly prohibited.

📚 Citation

If you find this useful in your research, please consider citing:

@inproceedings{SSBZ25,
  author = {Xinyue Shen and Yun Shen and Michael Backes and Yang Zhang},
  title = {{GPTracker: A Large-Scale Measurement of Misused GPTs}},
  booktitle = {{IEEE Symposium on Security and Privacy (S\&P)}},
  publisher = {IEEE},
  year = {2025}
}

About

[S&P'25] GPTracker: A Large-Scale Measurement of Misused GPTs

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages