Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERTopic Loading Issue #1764

Closed
isiktopcu opened this issue Jan 20, 2024 · 7 comments · Fixed by #1960
Closed

BERTopic Loading Issue #1764

isiktopcu opened this issue Jan 20, 2024 · 7 comments · Fixed by #1960

Comments

@isiktopcu
Copy link

Hi I used dimendsion reduction tehcniques and saved the model.
I can load it, but it doesnt predict topics for a new dataset,

umap_model = UMAP(n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine')

hdbscan_model = HDBSCAN(min_cluster_size=15,
                        metric='euclidean',
                        cluster_selection_method='eom',
                        prediction_data=True)

topic_model = BERTopic(umap_model=umap_model,
                    hdbscan_model=hdbscan_model,
                    calculate_probabilities=True,
                    language=language,
                    nr_topics=50)

model_save_path = os.path.join(repo_path, location, f'{location}_model')
topic_model.save(model_save_path)
print(f"Model saved to {model_save_path}")

model_path = "/content/drive/MyDrive/istanbul-crm-topic-modeling/istanbul/istanbul/istanbul_model"
model = BERTopic.load(model_path)

data_to_predict_path = "/content/drive/MyDrive/istanbul-crm-topic-modeling/istanbul/stratified_sample_20K.json"
df_to_predict = pd.read_json(data_to_predict_path, orient="records", lines=True)
docs_to_predict = df_to_predict['Başvuru Açıklaması'].tolist() # Replace with your actual column name
topics, probabilities = model.transform(docs_to_predict)

AttributeError: 'tuple' object has no attribute 'shape'

@MaartenGr
Copy link
Owner

Could you also share your imports and clean up the formatting a bit? As it is, it is difficult to read when you are not using the ```python tags. Also, which version of BERTopic are you using?

@isiktopcu
Copy link
Author

Hi Marteen !

Sorry I was in a hurry. That's why I had to type so fast.

BERTopic Version: 0.16.0

My whole code with the imports is below,
Thank you in advance,

!pip install zeyrek ratelimit unicode_tr unidecode
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
!pip install cudf-cu11==23.04 cuml-cu11==23.04 cugraph-cu11==23.04 cucim --extra-index-url=https://pypi.nvidia.com
!pip install -U  kaleido
!pip install cupy-cuda12x
from google.colab import drive
drive.mount('/content/gdrive')

#GEOJSON DOWNLOAD
osm = "https://www.openstreetmap.org/api/0.6/relation/{id}"
nominatim = "https://nominatim.openstreetmap.org/lookup"
mahalle_geojson = {
    "type": "FeatureCollection",
    "name": "İstanbul İlçeleri",
    "licence": "Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright",
    "features": []
}

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]
@sleep_and_retry
@limits(calls=1, period=1)
def get_full(rel_id):
    r = requests.get(osm.format(id=rel_id), headers={"Accept": "application/json"})
    if r.status_code == 200:
        data = json.loads(r.content)
        return data
def get_subareas(elements):
    return [member["ref"] for member in elements["members"] if member["role"] == "subarea"]
@sleep_and_retry
@limits(calls=1, period=1)
def get_geojson(rel_ids):
    r = requests.get(nominatim, params={
        "osm_ids": ",".join(["R" + str(id_) for id_ in rel_ids]),
        "format": "geojson",
        "polygon_geojson": 1
    })
    if r.status_code != 200:
        raise Exception('API response: {}'.format(r.status_code))
    return json.loads(r.content)

def write_to_file(filename, obj):
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)

def correct_turkish_characters(text):
    return unicode_tr(text).lower()

analyzer = zeyrek.MorphAnalyzer()
def lemmatize_and_correct_characters(keyword, analyzer):
    lemmas = analyzer.lemmatize(keyword)
    lemma = keyword
    if lemmas:
        if isinstance(lemmas[0], tuple) and len(lemmas[0]) > 1:
            lemma = lemmas[0][1][0]
    corrected_lemma = unicode_tr(lemma).lower()
    return corrected_lemma
def preprocess_documents(docs, turkish_stopwords):
    processed_docs = []
    for doc in docs:
        if pd.isna(doc):
            doc = " "
        elif isinstance(doc, str):
            doc = html.unescape(doc)
            doc = correct_turkish_characters(doc)
            doc = ' '.join([word for word in doc.split() if word not in turkish_stopwords])
        else:
            doc = str(doc)
        processed_docs.append(doc)
    return processed_docs
def read_data(file_path, file_type):
    if file_type == "csv":
        return pd.read_csv(file_path)
    elif file_type == "json":
        return pd.read_json(file_path,orient="records",lines=True)
    elif file_type == "excel":
        return pd.read_excel(file_path)
    else:
        raise ValueError(f"Unsupported file type: {file_type}")
def save_topics_and_keywords(topic_model, repo_path, location):
    topics = topic_model.get_topics()
    data_for_csv = []

    for topic_number, topic in topics.items():
        top_keywords = [word for word, _ in topic[:20]]
        data_for_csv.append({
            'Topic Number': topic_number,
            'Top 20 Keywords': ', '.join(top_keywords)})

    df_topics_keywords = pd.DataFrame(data_for_csv)
    csv_path = os.path.join(repo_path, location, 'topics_and_keywords.csv')
    df_topics_keywords.to_csv(csv_path, index=False)
    print(f"Topics and keywords saved to {csv_path}")

def main(repo_path, location, data_type, file_name, language,relation):
    gpus = tf.config.experimental.list_physical_devices('GPU')
    if gpus:
        try:
            for gpu in gpus:
                #tf.config.experimental.set_memory_growth(gpu, True)
                tf.config.experimental.set_virtual_device_configuration(
                    gpu,
                    [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])
        except RuntimeError as e:
            print(e)
    output_dir = os.path.join(repo_path, location)
    os.makedirs(output_dir, exist_ok=True)

    pendik_full = get_full(relation)
    mahalleler = get_subareas(pendik_full["elements"][0])

    for chunk in chunks(mahalleler, 50):
        mahalle_geojson["features"].extend(get_geojson(chunk)["features"])

    write_to_file(f"/content/gdrive/MyDrive/istanbul-crm-topic-modeling/{location}/{location}_neighborhoods.geojson", mahalle_geojson)
    print("Neighborhood geojson created successfully!")
    gdf = gpd.read_file(f"/content/gdrive/MyDrive/istanbul-crm-topic-modeling/{location}/{location}_neighborhoods.geojson")
    gdf['name'] = gdf['name'].apply(lambda x: unidecode(x).upper())
    name_mapping = {
    'SILE': 'ISTANBUL SILE'}

    print("Neighborhood geojson loaded successfully!")
    stopwords_file_path = os.path.join('/content/gdrive/MyDrive/', 'turkce-stop-words.txt')
    with open(stopwords_file_path, 'r', encoding='utf-8') as file:
        turkish_stopwords = set(file.read().splitlines())

    data_path = os.path.join(repo_path, file_name)
    try:
        df = read_data(data_path, data_type)
        original_docs = df['Başvuru Açıklaması'].tolist()
        df['Başvuru İlçesi'] = df['Başvuru İlçesi'].str.lower()
        neighborhood_list = df['Başvuru İlçesi'].tolist()
        preprocessed_docs = preprocess_documents(original_docs, turkish_stopwords)
        print("Data loaded and preprocessed successfully!")
    except Exception as e:
        print(f"Error: {e}")
        return

#MODEL

    umap_model = UMAP(n_neighbors=15,
                    n_components=5,
                    min_dist=0.0,
                    metric='cosine')

    hdbscan_model = HDBSCAN(min_cluster_size=15,
                            metric='euclidean',
                            cluster_selection_method='eom',
                            prediction_data=True)

    topic_model = BERTopic(umap_model=umap_model,
                        hdbscan_model=hdbscan_model,
                        calculate_probabilities=True,
                        language=language,
                        nr_topics=30)
    df['Başvuru Açıklaması'] = df['Başvuru Açıklaması'].astype(str)
    topics, probs = topic_model.fit_transform(preprocessed_docs)
    save_topics_and_keywords(topic_model, repo_path, location)
    top_topics_per_doc = []
    for doc_probs in probs:
        sorted_topics = sorted(range(len(doc_probs)), key=lambda i: doc_probs[i], reverse=True)
        top_topics_per_doc.append(sorted_topics[:4])
    unique_topics = len(set(topics) - {-1})
    print(f"Number of unique topics found (excluding -1 for irrelevant/outlier topics): {unique_topics}")

    df['politus_best_topic'] = topics
    df['politus_second_topic'] = [doc_topics[1] if len(doc_topics) > 1 else None for doc_topics in top_topics_per_doc]
    df['politus_third_topic'] = [doc_topics[2] if len(doc_topics) > 2 else None for doc_topics in top_topics_per_doc]
    df['politus_fourth_topic'] = [doc_topics[3] if len(doc_topics) > 3 else None for doc_topics in top_topics_per_doc]
    output_file_path = os.path.join(repo_path, location, f'{location}_topic_assignments.csv')
    df.to_csv(output_file_path, index=False)
    print(f"Topic assignments saved to {output_file_path}")

    probs_df = pd.DataFrame(probs, columns=[f'topic_{i}' for i in range(len(probs[0]))])
    probs_df['neighborhood'] = neighborhood_list
    probs_df['neighborhood'] = probs_df['neighborhood'].apply(lambda x: unidecode(x).upper())

    topic_proportions = probs_df.groupby('neighborhood').mean()
    topic_proportions = topic_proportions.reset_index()

     #Hierarchy
    fig = topic_model.visualize_hierarchy(top_n_topics=50)
    hierarchy_plot_path = os.path.join(repo_path, location, 'hierarchy_plot.png')
    fig.write_image(hierarchy_plot_path)
    print(f"Hierarchy saved to {output_file_path}")

    #Words
    fig2 = topic_model.visualize_barchart(top_n_topics=50)
    words_plot_path = os.path.join(repo_path, location, 'words_plot.png')
    fig2.write_image(words_plot_path)
    print(f"Words saved to {output_file_path}")
    #Heatmap
    fig3 = topic_model.visualize_heatmap(n_clusters=2, width=1000, height=1000)
    heatmap_plot_path = os.path.join(repo_path, location, 'heatmap_plot.png')
    fig3.write_image(heatmap_plot_path)
    print(f"Heatmap saved to {output_file_path}")
    #Term Rank
    fig4 = topic_model.visualize_term_rank()
    term_rank_plot_path = os.path.join(repo_path, location, 'term_rank_plot.png')
    fig4.write_image(term_rank_plot_path)
    print(f"Term Rank saved to {output_file_path}")
    #Topic Importances
    topic_counts = df['politus_best_topic'].value_counts()
    topic_importance = pd.DataFrame({
        'Topic Number': topic_counts.index,
        'Importance': topic_counts.values
    })

    # Save the Topic Importance DataFrame
    topic_importance_path = os.path.join(repo_path, location, 'topic_importance.csv')
    topic_importance.to_csv(topic_importance_path, index=False)
    print(f"Topic importance saved to {topic_importance_path}")

    #Save
    model_save_path = os.path.join(repo_path, location, f'{location}_model')
    topic_model.save(model_save_path,serialization="pickle")
    print(f"Model saved to {model_save_path}")

    topic_proportions_long = topic_proportions.reset_index().melt(id_vars=['neighborhood'], var_name='topic', value_name='proportion')
    joint_gdf = gdf.merge(topic_proportions_long, left_on='name', right_on='neighborhood')
    probs_df = pd.DataFrame(probs, columns=[f'topic_{i}' for i in range(len(probs[0]))])
    probs_df['neighborhood'] = neighborhood_list
    probs_df['neighborhood'] = probs_df['neighborhood'].apply(lambda x: unidecode(x).upper())
    probs_df['neighborhood'] = probs_df['neighborhood'].replace(name_mapping)

    topic_proportions = probs_df.groupby('neighborhood').mean()
    topic_proportions_long = topic_proportions.reset_index().melt(id_vars=['neighborhood'], var_name='topic', value_name='proportion')
    joint_gdf = gdf.merge(topic_proportions_long, left_on='name', right_on='neighborhood')
    # Get unique neighborhood names
    unique_neighborhoods_probs_df = set(probs_df['neighborhood'].unique())
    unique_neighborhoods_gdf = set(gdf['name'].unique())
    # Find unmatched neighborhoods
    unmatched_in_probs_df = unique_neighborhoods_probs_df - unique_neighborhoods_gdf
    unmatched_in_gdf = unique_neighborhoods_gdf - unique_neighborhoods_probs_df

    print("Unmatched neighborhoods in probs_df:", unmatched_in_probs_df)
    print("Unmatched neighborhoods in gdf:", unmatched_in_gdf)
    top_keywords = {}
    for topic_id in topic_model.get_topics():
        top_3_keywords = sorted(topic_model.get_topics()[topic_id], key=lambda x: x[1], reverse=True)[:5]
        lemmatized_keywords = set()
        for word, _ in top_3_keywords:
            corrected_lemma = lemmatize_and_correct_characters(word, analyzer)
            lemmatized_keywords.add(corrected_lemma)
        top_keywords[f'topic_{topic_id}'] = list(lemmatized_keywords)[:5]

def plot_topic_maps_with_labels(gdf, joint_gdf, top_keywords, n_topics, repo_path, location, dpi=1000):
    cmap = plt.cm.RdPu
    for batch in range(n_batches):
        start_topic = batch * batch_size
        end_topic = min(start_topic + batch_size, n_topics)
        n_cols = 5
        n_rows = (end_topic - start_topic) // n_cols + ((end_topic - start_topic) % n_cols > 0)
        fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(n_cols * 6, n_rows * 6))

        for i, ax in enumerate(axes.flatten()):
            topic_number = start_topic + i
            if topic_number < end_topic:
                formatted_keywords = ', '.join(top_keywords.get(f'topic_{topic_number}', []))
                data_to_plot = joint_gdf[joint_gdf['topic'] == f'topic_{topic_number}']
                if not data_to_plot.empty:
                    gdf.boundary.plot(ax=ax, edgecolor='gray')
                    data_to_plot.plot(column='proportion', cmap=cmap, linewidth=0.8, ax=ax)

                    # Add labels for each neighborhood
                    for idx, row in gdf.iterrows():
                        ax.text(row['geometry'].centroid.x, row['geometry'].centroid.y, row['name'], fontsize=8, ha='center', va='center')

                    ax.set_title(f'Topic {topic_number}: {formatted_keywords}')
                ax.axis('off')
            else:
                ax.axis('off')

        plt.tight_layout()
        plt.savefig(f"{repo_path}/{location}/topics_distribution_topic_{topic_number}.png", dpi=dpi)
        plt.close(fig)

if __name__ == "__main__":
    repo_path = '/content/gdrive/MyDrive/istanbul-crm-topic-modeling/istanbul/'
    location = 'istanbul'
    data_type = 'json'
    file_name = 'stratified_sample_20K.json'
    language = 'multilingual'
    relation = '223474'

    main(repo_path, location, data_type, file_name, language, relation)```
------------------- 

model_save_path = os.path.join(repo_path, location, f'{location}_model')
topic_model.save(model_save_path)
print(f"Model saved to {model_save_path}")

model_path = "/content/drive/MyDrive/istanbul-crm-topic-modeling/istanbul/istanbul/istanbul_model"
model = BERTopic.load(model_path)

data_to_predict_path = "/content/drive/MyDrive/istanbul-crm-topic-modeling/istanbul/stratified_sample_20K.json"
df_to_predict = pd.read_json(data_to_predict_path, orient="records", lines=True)
docs_to_predict = df_to_predict['Başvuru Açıklaması'].tolist() # Replace with your actual column name
topics, probabilities = model.transform(docs_to_predict)

AttributeError: 'tuple' object has no attribute 'shape'

--------------------

    
    







@MaartenGr
Copy link
Owner

I believe this might be related to cuML. Have you checked your error message on the open/closed issues? I remember there being one or more about this which state that simply upgrading your cuML version might fix things.

@isiktopcu
Copy link
Author

Hi Maarten! Thank you for your input. The problem can be solved adding:

probabilities = probabilities[1]

at the beginning of this part of the script. But it's hard to fix the source code all the time. We would greatly appreciate if you could change the source code.

        # Map array of probabilities (probability for assigned topic per document)
        probabilities = probabilities[1]
        if probabilities is not None:
            if len(probabilities.shape) == 2:
                mapped_probabilities = np.zeros((probabilities.shape[0],
                                                 len(set(mappings.values())) - self._outliers))
                for from_topic, to_topic in mappings.items():
                    if to_topic != -1 and from_topic != -1:
                        mapped_probabilities[:, to_topic] += probabilities[:, from_topic]

                return mapped_probabilities

        return probabilities

@MaartenGr
Copy link
Owner

Thanks for the suggestion but like I said, I believe this is fixed with a newer version of cuML. What you suggest might actually break most other applications since generally this probabilities = probabilities[1] is not needed. For instance, if we would add this and run without cuML, you would run into errors. Therefore, I would highly suggest first trying the advise suggested in those issues first as this does not solve the underlying problem.

@isiktopcu
Copy link
Author

I'm still facing the same issue with cuml 23.10.0 , 23.08.0 and BERTopic 0.16.0.

@MaartenGr
Copy link
Owner

It seems this is related to #1317 which indeed mentions that even though the newest versions of cuML should fix the issue, users are still experiencing this issue. It might be worthwhile to also open up an issue on the cuML repo. Other than that, I can imagine two fixes, either you can try the fix in #1324 but I'm not sure whether that fixes all of the issues or you use the following snippet instead before running .transform:

from bertopic.cluster import BaseCluster
topic_model.hdbscan_model = BaseCluster()

This way, the inference is done using purely embeddings and no dimensionality reduction or clustering algorithms. It should also speed inference up quite a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants