BERTopic Loading Issue #1764

isiktopcu · 2024-01-20T19:07:50Z

Hi I used dimendsion reduction tehcniques and saved the model.
I can load it, but it doesnt predict topics for a new dataset,

umap_model = UMAP(n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine')

hdbscan_model = HDBSCAN(min_cluster_size=15,
                        metric='euclidean',
                        cluster_selection_method='eom',
                        prediction_data=True)

topic_model = BERTopic(umap_model=umap_model,
                    hdbscan_model=hdbscan_model,
                    calculate_probabilities=True,
                    language=language,
                    nr_topics=50)

model_save_path = os.path.join(repo_path, location, f'{location}_model')
topic_model.save(model_save_path)
print(f"Model saved to {model_save_path}")

model_path = "/content/drive/MyDrive/istanbul-crm-topic-modeling/istanbul/istanbul/istanbul_model"
model = BERTopic.load(model_path)

data_to_predict_path = "/content/drive/MyDrive/istanbul-crm-topic-modeling/istanbul/stratified_sample_20K.json"
df_to_predict = pd.read_json(data_to_predict_path, orient="records", lines=True)
docs_to_predict = df_to_predict['Başvuru Açıklaması'].tolist() # Replace with your actual column name
topics, probabilities = model.transform(docs_to_predict)

AttributeError: 'tuple' object has no attribute 'shape'

The text was updated successfully, but these errors were encountered:

MaartenGr · 2024-01-22T13:29:01Z

Could you also share your imports and clean up the formatting a bit? As it is, it is difficult to read when you are not using the ```python tags. Also, which version of BERTopic are you using?

isiktopcu · 2024-01-22T15:05:58Z

Hi Marteen !

Sorry I was in a hurry. That's why I had to type so fast.

BERTopic Version: 0.16.0

My whole code with the imports is below,
Thank you in advance,

!pip install zeyrek ratelimit unicode_tr unidecode
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
!pip install cudf-cu11==23.04 cuml-cu11==23.04 cugraph-cu11==23.04 cucim --extra-index-url=https://pypi.nvidia.com
!pip install -U  kaleido
!pip install cupy-cuda12x
from google.colab import drive
drive.mount('/content/gdrive')

#GEOJSON DOWNLOAD
osm = "https://www.openstreetmap.org/api/0.6/relation/{id}"
nominatim = "https://nominatim.openstreetmap.org/lookup"
mahalle_geojson = {
    "type": "FeatureCollection",
    "name": "İstanbul İlçeleri",
    "licence": "Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright",
    "features": []
}

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]
@sleep_and_retry
@limits(calls=1, period=1)
def get_full(rel_id):
    r = requests.get(osm.format(id=rel_id), headers={"Accept": "application/json"})
    if r.status_code == 200:
        data = json.loads(r.content)
        return data
def get_subareas(elements):
    return [member["ref"] for member in elements["members"] if member["role"] == "subarea"]
@sleep_and_retry
@limits(calls=1, period=1)
def get_geojson(rel_ids):
    r = requests.get(nominatim, params={
        "osm_ids": ",".join(["R" + str(id_) for id_ in rel_ids]),
        "format": "geojson",
        "polygon_geojson": 1
    })
    if r.status_code != 200:
        raise Exception('API response: {}'.format(r.status_code))
    return json.loads(r.content)

def write_to_file(filename, obj):
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)

def correct_turkish_characters(text):
    return unicode_tr(text).lower()

analyzer = zeyrek.MorphAnalyzer()
def lemmatize_and_correct_characters(keyword, analyzer):
    lemmas = analyzer.lemmatize(keyword)
    lemma = keyword
    if lemmas:
        if isinstance(lemmas[0], tuple) and len(lemmas[0]) > 1:
            lemma = lemmas[0][1][0]
    corrected_lemma = unicode_tr(lemma).lower()
    return corrected_lemma
def preprocess_documents(docs, turkish_stopwords):
    processed_docs = []
    for doc in docs:
        if pd.isna(doc):
            doc = " "
        elif isinstance(doc, str):
            doc = html.unescape(doc)
            doc = correct_turkish_characters(doc)
            doc = ' '.join([word for word in doc.split() if word not in turkish_stopwords])
        else:
            doc = str(doc)
        processed_docs.append(doc)
    return processed_docs
def read_data(file_path, file_type):
    if file_type == "csv":
        return pd.read_csv(file_path)
    elif file_type == "json":
        return pd.read_json(file_path,orient="records",lines=True)
    elif file_type == "excel":
        return pd.read_excel(file_path)
    else:
        raise ValueError(f"Unsupported file type: {file_type}")
def save_topics_and_keywords(topic_model, repo_path, location):
    topics = topic_model.get_topics()
    data_for_csv = []

    for topic_number, topic in topics.items():
        top_keywords = [word for word, _ in topic[:20]]
        data_for_csv.append({
            'Topic Number': topic_number,
            'Top 20 Keywords': ', '.join(top_keywords)})

    df_topics_keywords = pd.DataFrame(data_for_csv)
    csv_path = os.path.join(repo_path, location, 'topics_and_keywords.csv')
    df_topics_keywords.to_csv(csv_path, index=False)
    print(f"Topics and keywords saved to {csv_path}")

def main(repo_path, location, data_type, file_name, language,relation):
    gpus = tf.config.experimental.list_physical_devices('GPU')
    if gpus:
        try:
            for gpu in gpus:
                #tf.config.experimental.set_memory_growth(gpu, True)
                tf.config.experimental.set_virtual_device_configuration(
                    gpu,
                    [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])
        except RuntimeError as e:
            print(e)
    output_dir = os.path.join(repo_path, location)
    os.makedirs(output_dir, exist_ok=True)

    pendik_full = get_full(relation)
    mahalleler = get_subareas(pendik_full["elements"][0])

    for chunk in chunks(mahalleler, 50):
        mahalle_geojson["features"].extend(get_geojson(chunk)["features"])

    write_to_file(f"/content/gdrive/MyDrive/istanbul-crm-topic-modeling/{location}/{location}_neighborhoods.geojson", mahalle_geojson)
    print("Neighborhood geojson created successfully!")
    gdf = gpd.read_file(f"/content/gdrive/MyDrive/istanbul-crm-topic-modeling/{location}/{location}_neighborhoods.geojson")
    gdf['name'] = gdf['name'].apply(lambda x: unidecode(x).upper())
    name_mapping = {
    'SILE': 'ISTANBUL SILE'}

    print("Neighborhood geojson loaded successfully!")
    stopwords_file_path = os.path.join('/content/gdrive/MyDrive/', 'turkce-stop-words.txt')
    with open(stopwords_file_path, 'r', encoding='utf-8') as file:
        turkish_stopwords = set(file.read().splitlines())

    data_path = os.path.join(repo_path, file_name)
    try:
        df = read_data(data_path, data_type)
        original_docs = df['Başvuru Açıklaması'].tolist()
        df['Başvuru İlçesi'] = df['Başvuru İlçesi'].str.lower()
        neighborhood_list = df['Başvuru İlçesi'].tolist()
        preprocessed_docs = preprocess_documents(original_docs, turkish_stopwords)
        print("Data loaded and preprocessed successfully!")
    except Exception as e:
        print(f"Error: {e}")
        return

#MODEL

    umap_model = UMAP(n_neighbors=15,
                    n_components=5,
                    min_dist=0.0,
                    metric='cosine')

    hdbscan_model = HDBSCAN(min_cluster_size=15,
                            metric='euclidean',
                            cluster_selection_method='eom',
                            prediction_data=True)

    topic_model = BERTopic(umap_model=umap_model,
                        hdbscan_model=hdbscan_model,
                        calculate_probabilities=True,
                        language=language,
                        nr_topics=30)
    df['Başvuru Açıklaması'] = df['Başvuru Açıklaması'].astype(str)
    topics, probs = topic_model.fit_transform(preprocessed_docs)
    save_topics_and_keywords(topic_model, repo_path, location)
    top_topics_per_doc = []
    for doc_probs in probs:
        sorted_topics = sorted(range(len(doc_probs)), key=lambda i: doc_probs[i], reverse=True)
        top_topics_per_doc.append(sorted_topics[:4])
    unique_topics = len(set(topics) - {-1})
    print(f"Number of unique topics found (excluding -1 for irrelevant/outlier topics): {unique_topics}")

    df['politus_best_topic'] = topics
    df['politus_second_topic'] = [doc_topics[1] if len(doc_topics) > 1 else None for doc_topics in top_topics_per_doc]
    df['politus_third_topic'] = [doc_topics[2] if len(doc_topics) > 2 else None for doc_topics in top_topics_per_doc]
    df['politus_fourth_topic'] = [doc_topics[3] if len(doc_topics) > 3 else None for doc_topics in top_topics_per_doc]
    output_file_path = os.path.join(repo_path, location, f'{location}_topic_assignments.csv')
    df.to_csv(output_file_path, index=False)
    print(f"Topic assignments saved to {output_file_path}")

    probs_df = pd.DataFrame(probs, columns=[f'topic_{i}' for i in range(len(probs[0]))])
    probs_df['neighborhood'] = neighborhood_list
    probs_df['neighborhood'] = probs_df['neighborhood'].apply(lambda x: unidecode(x).upper())

    topic_proportions = probs_df.groupby('neighborhood').mean()
    topic_proportions = topic_proportions.reset_index()

     #Hierarchy
    fig = topic_model.visualize_hierarchy(top_n_topics=50)
    hierarchy_plot_path = os.path.join(repo_path, location, 'hierarchy_plot.png')
    fig.write_image(hierarchy_plot_path)
    print(f"Hierarchy saved to {output_file_path}")

    #Words
    fig2 = topic_model.visualize_barchart(top_n_topics=50)
    words_plot_path = os.path.join(repo_path, location, 'words_plot.png')
    fig2.write_image(words_plot_path)
    print(f"Words saved to {output_file_path}")
    #Heatmap
    fig3 = topic_model.visualize_heatmap(n_clusters=2, width=1000, height=1000)
    heatmap_plot_path = os.path.join(repo_path, location, 'heatmap_plot.png')
    fig3.write_image(heatmap_plot_path)
    print(f"Heatmap saved to {output_file_path}")
    #Term Rank
    fig4 = topic_model.visualize_term_rank()
    term_rank_plot_path = os.path.join(repo_path, location, 'term_rank_plot.png')
    fig4.write_image(term_rank_plot_path)
    print(f"Term Rank saved to {output_file_path}")
    #Topic Importances
    topic_counts = df['politus_best_topic'].value_counts()
    topic_importance = pd.DataFrame({
        'Topic Number': topic_counts.index,
        'Importance': topic_counts.values
    })

    # Save the Topic Importance DataFrame
    topic_importance_path = os.path.join(repo_path, location, 'topic_importance.csv')
    topic_importance.to_csv(topic_importance_path, index=False)
    print(f"Topic importance saved to {topic_importance_path}")

    #Save
    model_save_path = os.path.join(repo_path, location, f'{location}_model')
    topic_model.save(model_save_path,serialization="pickle")
    print(f"Model saved to {model_save_path}")

    topic_proportions_long = topic_proportions.reset_index().melt(id_vars=['neighborhood'], var_name='topic', value_name='proportion')
    joint_gdf = gdf.merge(topic_proportions_long, left_on='name', right_on='neighborhood')
    probs_df = pd.DataFrame(probs, columns=[f'topic_{i}' for i in range(len(probs[0]))])
    probs_df['neighborhood'] = neighborhood_list
    probs_df['neighborhood'] = probs_df['neighborhood'].apply(lambda x: unidecode(x).upper())
    probs_df['neighborhood'] = probs_df['neighborhood'].replace(name_mapping)

    topic_proportions = probs_df.groupby('neighborhood').mean()
    topic_proportions_long = topic_proportions.reset_index().melt(id_vars=['neighborhood'], var_name='topic', value_name='proportion')
    joint_gdf = gdf.merge(topic_proportions_long, left_on='name', right_on='neighborhood')
    # Get unique neighborhood names
    unique_neighborhoods_probs_df = set(probs_df['neighborhood'].unique())
    unique_neighborhoods_gdf = set(gdf['name'].unique())
    # Find unmatched neighborhoods
    unmatched_in_probs_df = unique_neighborhoods_probs_df - unique_neighborhoods_gdf
    unmatched_in_gdf = unique_neighborhoods_gdf - unique_neighborhoods_probs_df

    print("Unmatched neighborhoods in probs_df:", unmatched_in_probs_df)
    print("Unmatched neighborhoods in gdf:", unmatched_in_gdf)
    top_keywords = {}
    for topic_id in topic_model.get_topics():
        top_3_keywords = sorted(topic_model.get_topics()[topic_id], key=lambda x: x[1], reverse=True)[:5]
        lemmatized_keywords = set()
        for word, _ in top_3_keywords:
            corrected_lemma = lemmatize_and_correct_characters(word, analyzer)
            lemmatized_keywords.add(corrected_lemma)
        top_keywords[f'topic_{topic_id}'] = list(lemmatized_keywords)[:5]

def plot_topic_maps_with_labels(gdf, joint_gdf, top_keywords, n_topics, repo_path, location, dpi=1000):
    cmap = plt.cm.RdPu
    for batch in range(n_batches):
        start_topic = batch * batch_size
        end_topic = min(start_topic + batch_size, n_topics)
        n_cols = 5
        n_rows = (end_topic - start_topic) // n_cols + ((end_topic - start_topic) % n_cols > 0)
        fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(n_cols * 6, n_rows * 6))

        for i, ax in enumerate(axes.flatten()):
            topic_number = start_topic + i
            if topic_number < end_topic:
                formatted_keywords = ', '.join(top_keywords.get(f'topic_{topic_number}', []))
                data_to_plot = joint_gdf[joint_gdf['topic'] == f'topic_{topic_number}']
                if not data_to_plot.empty:
                    gdf.boundary.plot(ax=ax, edgecolor='gray')
                    data_to_plot.plot(column='proportion', cmap=cmap, linewidth=0.8, ax=ax)

                    # Add labels for each neighborhood
                    for idx, row in gdf.iterrows():
                        ax.text(row['geometry'].centroid.x, row['geometry'].centroid.y, row['name'], fontsize=8, ha='center', va='center')

                    ax.set_title(f'Topic {topic_number}: {formatted_keywords}')
                ax.axis('off')
            else:
                ax.axis('off')

        plt.tight_layout()
        plt.savefig(f"{repo_path}/{location}/topics_distribution_topic_{topic_number}.png", dpi=dpi)
        plt.close(fig)

if __name__ == "__main__":
    repo_path = '/content/gdrive/MyDrive/istanbul-crm-topic-modeling/istanbul/'
    location = 'istanbul'
    data_type = 'json'
    file_name = 'stratified_sample_20K.json'
    language = 'multilingual'
    relation = '223474'

    main(repo_path, location, data_type, file_name, language, relation)```
------------------- 

model_save_path = os.path.join(repo_path, location, f'{location}_model')
topic_model.save(model_save_path)
print(f"Model saved to {model_save_path}")

model_path = "/content/drive/MyDrive/istanbul-crm-topic-modeling/istanbul/istanbul/istanbul_model"
model = BERTopic.load(model_path)

data_to_predict_path = "/content/drive/MyDrive/istanbul-crm-topic-modeling/istanbul/stratified_sample_20K.json"
df_to_predict = pd.read_json(data_to_predict_path, orient="records", lines=True)
docs_to_predict = df_to_predict['Başvuru Açıklaması'].tolist() # Replace with your actual column name
topics, probabilities = model.transform(docs_to_predict)

AttributeError: 'tuple' object has no attribute 'shape'

--------------------

MaartenGr · 2024-01-23T05:09:44Z

I believe this might be related to cuML. Have you checked your error message on the open/closed issues? I remember there being one or more about this which state that simply upgrading your cuML version might fix things.

isiktopcu · 2024-01-31T17:46:37Z

Hi Maarten! Thank you for your input. The problem can be solved adding:

probabilities = probabilities[1]

at the beginning of this part of the script. But it's hard to fix the source code all the time. We would greatly appreciate if you could change the source code.

        # Map array of probabilities (probability for assigned topic per document)
        probabilities = probabilities[1]
        if probabilities is not None:
            if len(probabilities.shape) == 2:
                mapped_probabilities = np.zeros((probabilities.shape[0],
                                                 len(set(mappings.values())) - self._outliers))
                for from_topic, to_topic in mappings.items():
                    if to_topic != -1 and from_topic != -1:
                        mapped_probabilities[:, to_topic] += probabilities[:, from_topic]

                return mapped_probabilities

        return probabilities

MaartenGr · 2024-01-31T18:08:26Z

Thanks for the suggestion but like I said, I believe this is fixed with a newer version of cuML. What you suggest might actually break most other applications since generally this probabilities = probabilities[1] is not needed. For instance, if we would add this and run without cuML, you would run into errors. Therefore, I would highly suggest first trying the advise suggested in those issues first as this does not solve the underlying problem.

isiktopcu · 2024-01-31T20:22:35Z

I'm still facing the same issue with cuml 23.10.0 , 23.08.0 and BERTopic 0.16.0.

MaartenGr · 2024-01-31T20:45:36Z

It seems this is related to #1317 which indeed mentions that even though the newest versions of cuML should fix the issue, users are still experiencing this issue. It might be worthwhile to also open up an issue on the cuML repo. Other than that, I can imagine two fixes, either you can try the fix in #1324 but I'm not sure whether that fixes all of the issues or you use the following snippet instead before running .transform:

from bertopic.cluster import BaseCluster
topic_model.hdbscan_model = BaseCluster()

This way, the inference is done using purely embeddings and no dimensionality reduction or clustering algorithms. It should also speed inference up quite a bit.

beckernick mentioned this issue Apr 15, 2024

model.transform() throwing error when using cuml for HDBSCAN with calculate_probabilities=True #1317

Closed

beckernick mentioned this issue May 1, 2024

Fix transform when using cuML HDBSCAN #1960

Merged

MaartenGr closed this as completed in #1960 May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERTopic Loading Issue #1764

BERTopic Loading Issue #1764

isiktopcu commented Jan 20, 2024

MaartenGr commented Jan 22, 2024

isiktopcu commented Jan 22, 2024

MaartenGr commented Jan 23, 2024

isiktopcu commented Jan 31, 2024

MaartenGr commented Jan 31, 2024

isiktopcu commented Jan 31, 2024

MaartenGr commented Jan 31, 2024

BERTopic Loading Issue #1764

BERTopic Loading Issue #1764

Comments

isiktopcu commented Jan 20, 2024

MaartenGr commented Jan 22, 2024

isiktopcu commented Jan 22, 2024

MaartenGr commented Jan 23, 2024

isiktopcu commented Jan 31, 2024

MaartenGr commented Jan 31, 2024

isiktopcu commented Jan 31, 2024

MaartenGr commented Jan 31, 2024