By Yoav Vollansky (ver. 0.1)
Commit history is not available as the code was developed on a private repository and later uplodaed to GitHub.
[TOC]
SearchQueriesClustering is a class that performs clustering of search queries. The class supports clusteing of tabular data of which one column holds the search queries. A Gensim's Doc2Vec representation of that column is undergoing unsupervised clustering, and each cluster gets a representation (in a form of an inferred textual name). Score is then calculated for each cluster, based on its consistency of queries which could stand for cluster "purity". A typical further analysis would be to employ such process over some span of time and comparing clusters scored over days, weeks or moths.
- Using Conda or Pip, use requirements.txt in a new environment to install all required packages.
- Copy the csv files you intend to work with to the
data
directory (default: ./data/)
The class expect tabular data with the following columns: search_date
, search_query
, number_of_searches
, searches_with_clicks
, total_number_of_search_results
, listings_with_orders
, orders
, most_common_sub_category_name
, and most_common_category_name
. For this version, some of the columns are aggregated within the script (in SearchQueriesClustering.aggregate
) while others are filtered out.
An example of a Google BigQuery query that would produce a desirable output to be the input to the class:
select search_date, search_query,
sum(number_of_searches) number_of_searches,
sum(searches_with_clicks) searches_with_clicks,
avg(total_number_of_search_results) total_number_of_search_results,
sum(listings_with_orders) listings_with_orders,
sum(orders) orders,
sum(revenue),
most_common_sub_category_name,
most_common_category_name
from `itc-data-science.dwh.search_keywords`
where search_query is not null
and search_date >'2020-01-01'
group by search_date, search_query, most_common_sub_category_name,most_common_category_name
order by 2
Note: change search_date > ...
according to requirements of time span.
There are two ways to work with the class:
- General query clustering: working on a single data frame, clustering it as it is, whether it has multiple days or not. This is done by using
make_clusters()
. - Daily query clustering: working on a data frame but splitting it by date into many smaller data frame in the process. This happens automatically in the pipeline when using
make_daily_clusters()
.
Read below how to use both ways.
Clustering itself is done using either K-Means, Affinity Propagation or HDBscan. At this point the safest way to work is by using HDBScan as it was the method tested the most (especially with daily clustering).
While when working with general query clustering and make_clusters()
only give us one high-level SerchQueriesClustering instance that only contains a single model and DataFrame to deal with, when working with daily queries clustering and make_daily_clusters()
', our instance will contain more attributes that have per-day information. One of those attributes, for example, is daily_clusters
, which is a dictionary of SearchQueriesClustering instances, one for each day.
The core idea is that when working with daily clustering, we are in essence carrying a separate process for each daily segment of the data. For each day, the process will perform cleaning, aggregating, doc2vec embedding, clustering, naming and scoring.
Note: in case of daily clustering, pre precessing and cleaning was placed in the pipeline after splitting to days, since such splitting requiers the date column. If date column does not exist when splitting, an error would occur. If you intend to manually split to days, make sure to follow this practice.
Most of the configuration is contriled in the config.yaml
file. The preferences are as follows:
In the root of the config, you can define some most useful columns as they appear in the imported csv file: queries_col
is the queries column, date_col
is the date and num_search_col
is the number of searches column.
-
global
preferences -
always_load_files
will load pre-saved doc2vec models and clustering pickle files if they exist in thedata_dir
directory and fit to the current parameters, as defined indoc2vec: def_params
or clustering specific parameters ink_means: def_params
,aff_prop: def_params
andhdbscan: def_params
. This is helpful for fast pipeline testing and rerunning of pre-saved models. -
always_save_files
will overwrite existing files with similar parameters as mentioned above. -
drastic_cleanup
will delete all models and pickles. (This is not anything which is automated. It can be used be used as a conditional inmain()
, for example, in conjunction to the class's static methoddrastic_cleanup
. -
verbose
will cause to emit extra output when running some functions. -
always_plot
will plot some initial EDA visualisations when after loading a csv file (Note: only use for debugging, definitely disable in automated use cases as it requires user interaction) -
data_dir
is where input csv files reside -
saved_models_dir
is where dov2vec .model files and clustering objects .pickle files are saved to -
limit_datafreme
is the sample size of the input data frame. Best used only for development and testing, as models do require large, complete data, to work optimally. Set toNone
to leave the data frame as is. -
dataframes_prefs
will set Pandas output according to the next two properties,max_rows
andmax_cols
. -
agg
settings
During preprocessing, aggregate()
should be used , with columns set to sum
being accumulated, with mean
returning their average, and with first
simply returning the first value in the column, all with respects to each specific date.
-
TF-IDF related preferences are specified in
tf_idf
(initial development, suspended at the moment) -
doc2vec
settingsdef_params
for the model. Read here for details. Other parameters can be passed as**kwargs
to the training function (or should be hard coded and set inconfig.yaml
to it if used often).mode_filename_postfix
to be added at the end of the saved model filenametagged_documents_features
are other features in the input csv that will be used to tag the documents. Read more about TaggedDocument here.
-
k_means
,aff_prop
andhdbscan
are all clustering algorithm specific parameters. -
infer_name
are preference that related to the way a cluster is represented, i.e which name it gets after clustering is performed:max
is the maximum number of words in the title of a clustern_sim
is the number of similarities to each of the names inferred, based on word2vec similarity.method
can be one of the following:magnitude
,half
orlda
. Seeinfer_cluster_name()
below for more details.
-
daily
controls preferences forfilter_daily()
as follows:skip_start
skips first dates in the data (so if our first day in the data is2020-01-01
and we would like to start processing from2020-01-04
, we would set the value to3
)max_total_days
is the number of days we would want in our daily process. It is possible to get a smaller number of days if the dataset spans less days than we set for.interval
indicates the gap between days we take out of a data frame in a daily process. If it is set to7
(andskip_start=0
), then our first day would be2020-01-01
, second2020-01-08
, third2020-01-15
etc.
-
scoring
controls the settings forscore_cluster()
. It can be set to eitherquery_proportion
orword_proportion
. More details below.
Most of the functionality of the class is built into a pipeline and is executed automatically via wrapped functions. The individual functions can also be used manually by being called from the main module if pipeline customisation is needed (although currently this is still prone to breakage, so make sure to debug the process).
The specification of the class was thoroughly documented in this section both for the sake of any future development efforts, as well all to users of the class that that wish to customise the pipeline.
csv_file_name
is the name of the acv file that the data is loaded from. Mainly used for unique file naming when saving trained doc2vec models and clustering object pickle files to drive. It is set to__session_dataframe__
if the data was loaded from memory and not from a csv file.data
is the argument as passed to the class constructor, and it can either be a pd.DataFrame or a file name.current_date_df
is the date of the current data that is associated with the class instance if working in daily clustering mode (i.e if an original DataFrame was split into days)df
is the DataFrame in the current instance (and it can be either daily or general). It uses the_loader()
function to determine whether the argument that was passed to the constructor, and no resides indata
, is a filename or a DataFrame in memory, and loads the data accordingly.queries_col
is the default name for the queries column indf
, and can be changed inconfig.yaml
doc2vec_model
is the trained and fit model, saved in the instance level for convenient checking of parameters etc. Helpful for avoiding the need to retrain a model in case we test a few clustering approaches in the same session.similarity_arr
is the similarity array resulted from the TF-IDF process (not yet thoroughly supported in this version)doc2vec_arr
is the raw matrix produced bu the doc2vec model. Created along with thedoc2vec
attribute upon training and fitting.unique_search_queries
holds the unique search queries for an instancedf
.tokenized_queries
holds the tokenized queries of an instancedf
queries column.clustering_model
holds the model that was used to cluster adoc2vec_arr
. Useful for quick lookup of parameters.results
holds all the results for a given DataFrame, if in a session many clustering approaches were tested for comparison. For example, if for some DataFrame we tried HDBscan with some set of parameters, then changed those parameters and clustered again, and then changed to Affinity Propagation which is a different algorithm altogether, all the results are saved to this dictionary for quick and convenient inspection and comparison.daily_dfs
holds all the separated DataFrames per day once split byfilter_daily()
.daily_clusters
, similarly todaily_dfs
, is a dictionary of the resulted per-day SearchQueriesClustering instances resulted bymake_daily_clusters()
.daily_scores
is a dictionary for scores as calculated byscore_all_clusters()
.daily_scores_table
is a tabular version ofdaily_scores
all_cluster_names
remembers all the names of all clusters, for all days, which is essential for building thedaily_scores_table
increate_daily_scores_table()
.
def _loader(self)
is an internal method that adds flexibility to the data loading. Since we can give the class constructor either a string indicating a csv filename or an actual DataFrame from memory, _loader
is determining which of the two we passed and uses the appropriate way to load the data.
def load_data(csv_file, remove_na=True)
simply loads the data ins case _loader
has datelined we are loading a csv file.
def drastic_cleanup(remove_models=True, remove_pickles=True)
removes pre-saves files from working directory. It is not part of the internal pipeline at the moment, but can be conveniently used as a conditional in the main module.
def remove_na(df)
is removing NaN rows from the DataFrame. It is performed the earliest in the preprocessing pipeline and is usually called by load_data()
.
def remove_i_will(self)
will remove any query that starts with "I will", as we would like to try and filter out non-buyer related queries.
def clean_search_query(query)
cleans individual queries by removing non-alpha numerics, stripping white space from sides and lowercasing.
def clean_search_queries(self, clean=True, remove_numeric=True)
is a wrapper function around clean_search_query
that runs over all queries in the DataFrame and also drops any purely numeric queries if remove_numeric=True
.
Functions in this section are currently not utilised in the pipeline and were not thoroughly tested for this version.
def create_tfidf_similarity(self, queries_series, max_samples=config['tf_idf']['max_samples']):
creates a similarity matrix based on TF-IDF of a corpus.
def find_unique_queries(self):
gets all uniques from a DataFrame.
def show_n_similar_queries(self, query, n_similar):
once a tf-idf similarity array is created, this method returns most n similar queries to query
.
def find_all_non_zero_score_similar_queries(self, query)
same as above, only instead of showing n similarities, shows all non-zero-score similarities.
Many of the methods in this section (as well as some of the clustering related methods) check for dependencies in terms of required instance attributes, and automatically call other methods in case these attributes are not populated with a value (usually they are set to None
upon instantiation). This enables a one-shot call for wrapper methods, such as make_daily_clusters()
, without having to worry about running methods that are required to be executed priorly to other methods.
def tokenize_queries(self, pre_process=True)
tokenises the search queries column in the in the instance DataFrame and saves it as an attribute.
def create_tagged_documents(self,
tag_sequential=True,
tagging_features=config['doc2vec']['tagged_documents_features'])
creates a TaggedDocument object which is required for Gensim's Doc2Vec fitting. tag_sequential=True
is the straightforward way to go when doing the tagging, and all it essentially does is tagging each document with an increasing index (0, 1, 2, etc.). A more experimental approach would be to tag the documents using other metadata features from the dataset, such as number of searches or sub category. These other features can either be passed to tagging_features
manually, or set in config.yaml
when running the entire pipeline as a whole using make_clusters()
or make_daily_clusters(
).
See TaggedDocument for more details.
def fit_doc2vec(self,
filename=None,
model_params=config['doc2vec']['def_params'])
is fitting the doc2vec model with default parameters as specified in config.yaml
. filename can explicitly specified, but if not, it is inferred from the scv filename, date of queries (if daily clustering is performed) and model parameters, in order to create a unique filename to save the model to.
def load_doc2vec_model(self, filename=None):
loads a pre-saved doc2vec model from the specified file. If filename=None, the file name is inferred from the csv file name, date of queries (if daily clustering is performed) and model parameters. Note that config.yaml
always_load_files
setting should be set to True
in order for files to load (this condition is checked in make_clusters()
).
def marry_queries_to_labels(queries, cluster_labels)
is a helper function that takes the queries column and concatenates them to the cluster labels (numeric ones) resulted from clustering. It outputs a Pandas DataFrame.
def assert_doc2vec_model_exists(self)
is a helper function that checks whether the instance attributes doc2vec_model
and doc2vec_arr
have both been populated. This is used down the pipeline within each clustering method, since the clustering process is based on having doc2vec available.
The class currently supports three clustering algorithms: K-Means, Affinity Propagation and HDBScan. Future versions may test more approaches.
def k_means_clustering(self,
k=config['k_means']['def_params']['k'],
load_file=True,
**kwargs)
def aff_prop_clustering(self,
damping=config['aff_prop']['def_params']['damping'],
affinity=config['aff_prop']['def_params']['affinity'],
load_file=True,
**kwargs)
def hdbscan_clustering(self,
cluster_selection_epsilon=config['hdbscan']['def_params']['cluster_selection_epsilon'],
min_samples=config['hdbscan']['def_params']['min_samples'],
min_cluster_size=config['hdbscan']['def_params']['min_cluster_size'],
allow_single_cluster=config['hdbscan']['def_params']['allow_single_cluster'],
cluster_selection_method=config['hdbscan']['def_params']['cluster_selection_method'],
leaf_size=config['hdbscan']['def_params']['leaf_size'],
load_file=config['global']['always_load_files'],
**kwargs)
At the moment, parameters should be set in config.yaml
. Ones that are not set in the config file can be passed in a manual process via **kwargs
, and if found particularly important should be hard coded to the above methods and set in the config.
Note that each of the clustering methods above is also checking for the existence of priorly saved clustering result (as a pickle file) and loading it if load_file=True
.
def make_clusters(self,
algo='hdbscan',
save_results=config['global']['always_save_files'],
**kwargs)
is a wrapper function around the clustering algorithms, having hdbscan a the default choice as it was tested the most in this version. If save_results=True
, clustering object files will be saved and overwrite priorly saved pickle files for data from the same csv file and with the same clustering algorithm-specific preferences (since these factors, when put together in a string, is how the filename is formed).
def make_daily_clusters(self, score=True, **kwargs)
is a wrapper method that runs the entire pipeline for daily clustering.
First it loads the entire csv file into a higher-level SearchQueriesClustering instance, and then splits the input data into days and crates SearchQueriesClustering instances with each of the daily DataFrames, which are stored inside the daily_clusters
attribute dictionary. In turn, for each day it aggregates the data, cleans it, and uses make_clusters()
on each of the daily instances. If score=True
, it also scores the clusters and stores the results in the higher-level SearchQueriesClustering instance daily_scores
attribute.
Note that for analysis methods that are performed on a single DataFrame, execution must be within a daily SearchQueriesClustering instance, since if daily clustering was performed (as opposed to the higher-level instance, general clustering way), results can only be found within such instances. In other words, some analysis can only be performed per day, and if that is the case, it should be executed within one of the the daily_clusters
dictionary SearchQueriesClustering instances. Executing the methods on a higher level SearchQueriesClustering will not work.
def most_common_words_in_cluster(self, cluster_number,
labeled_queries=None,
n_common=5,
display_relevant_queries='top',
verbosity=25,
show_non_relevant_examples=True)
displays the n most common individual words in a labeled_queries
DataFrame, which is the result of any of the clustering algorithms. A cluster number should be passed to the method, where possible numbers are the labels in labeled_queries
.
If a labeled_queries
DataFrame is not specified, the last results from instance results
dictionary will be shown. If display_relevant_queries
is set (currently on the the 'top' switch is implemented), queries that contain the first most common word will be shown as well in output.
def get_cluster(self, cluster_number, labeled_queries=None)
is a helper function to get a cluster. If a labeled_queries
DataFrame is not specified, the last results from instance results
dictionary will be shown. cluster_number
should be passed explicitly as a keyword argument. Returns a DataFrame.
def show_cluster(self, **kwargs)
shows a cluster number, its name (calls infer_cluster_name()
) and unique queries in the cluster. If a labeled_queries
DataFrame is not specified, the last results from instance results
dictionary will be shown. Uses get_cluster()
to fetch the cluster data.
def show_some_clusters(self, labeled_queries=None)
is a wrapper around show_cluster()
that displays random clusters for analysis purposes. f a labeled_queries
DataFrame is not specified, the last results from instance results
dictionary will be shown.
def get_cluster_as_dict(self, cluster_number, include_similarities=False, *args, **kwargs)
will return the cluster in a dictionary format, including word2vec similarities if include_similarities=True
. Used in name_all_clusters()
.
def infer_cluster_name(self,
cluster,
method=config['infer_name']['method'],
max_n_names=config['infer_name']['max'],
n_similarities=config['infer_name']['n_sim'])
gets a cluster as produced from get_cluster()
and names it according to a naming method (default set in config.yaml
).
max_n_names
is the maximum number of words in the name (e.g logo_design is a 2 word name while just dropshipping is a 1 word name).
method
can be:
magnitude
: counts the number of individual words (not complete search queries) in the cluster and selecting words that appear in the same order of magnitude. For example, if the topmost counted word appears 1500 times, the second appears 1200 and the third 300, then only the first two words will be selected since they are in thousands.half
: keeps including words that are appear not less than half times than the word added before that. For example, if the topmost word appears 1200 times, and the second topmost appears only 500 (less than 1200/2=600), then it would not be part of the name.lda
: according to an lda algorithm. Still tested in this version.
def name_all_clusters(self, labeled_queries=None, include_similarities=True, return_pandas=False, dump_csv=True):
is a wrapper function to names all clusters and returns the results either as a dictionary or a Pandas DataFrame. Note that the method calls get_cluster_as_dict()
, which in turn calls infer_cluster_name()
(I,e, it does not call the latter directly).
def score_cluster(self, cluster_number, method=config['scoring']['method'])
will score a given cluster in terms of 'purity' or 'consistency'. A "stable cluster" would be such that has a relatively high score over time.
method
default value is set in config.yaml
and can be one of the following:
query_proportion
scores the cluster by calculating the rate of the topmost repeating query out of all queries.word_proportion
scores the cluster by calculating the rate of the topmost repeating word out of all words in all the queries in the cluster.
def score_all_clusters(self, labeled_queries=None, dump_csv=False, return_pandas=False)
will score all clusters in a labeled_queries
DataFrame (or if None
, then the DataFrame in the last instanceresults
entry) and return either a dictionary or a DataFrame according to return_pandas
. This function is used within the make_daily_clusters()
pipeline, then saving the results to the instance attribute daily_scores
.
def create_daily_scores_table(self)
based on the instance attribute daily_scores
(which is in the pipeline of make_daily_clusters()
), crating a scores table to all days. Stores the result in the instance attribute daily_scores_table
.
def get_cluster_daily_scores(self, cluster_name: str)
Gets scores over days for a cluster based on its name representation, if one such exists.
def get_cluster_daily_queries(self, cluster_name: str)
Gets queries over days for a cluster based on its name representation, if one such exists.
def get_cluster_daily_queries_and_scores(self, cluster_name)
is a unification of the two functions above. Not implemented yet in this version.
def dump_csv(df, filename)
is for saving of clustering output to a csv file.
def filter_daily(self,
df=None,
date_column=config['date_col'],
interval=config['daily']['interval'],
max_total_days=config['daily']['max_total_days'],
skip_start=config['daily']['skip_start'])
is filtering a multi-day data frame into separate days, saving the results to the instance attribute daily_dfs
.
skip_start
skips first dates in the data (so if our first day in the data is2020-01-01
and we would like to start processing from2020-01-04
, we would set the value to3
)max_total_days
is the number of days we would want in our daily process. It is possible to get a smaller number of days if the dataset spans less days than we set for.interval
indicates the gap between days we take out of a data frame in a daily process. If it is set to7
(andskip_start=0
), then our first day would be2020-01-01
, second2020-01-08
, third2020-01-15
etc.
def filter_one_day(sqc_obj)
is a helper function that returns one random day from a multi-day DataFrame.
def document_process_details(self, filename=None)
documents (saves to a text file named according to parameters etc, or simply to the string passed tofilename
) the details of the process or pipeline. Example of the file:
csv file: some_data.csv
doc2vec model: Doc2Vec(dm/m,d75,n5,w2,mc2,s0.001,t3)
clustering model: HDBSCAN(algorithm='best', allow_single_cluster=False, alpha=1.0,
approx_min_span_tree=True, cluster_selection_epsilon=0,
cluster_selection_method='leaf', core_dist_n_jobs=4,
gen_min_span_tree=False, leaf_size=5000,
match_reference_implementation=False, memory=Memory(location=None),
metric='euclidean', min_cluster_size=2, min_samples=1, p=None,
prediction_data=False)
This method is called from inside the make_daily_clusters()
pipeline.
def get_available_cluster_names(self)
would return all available clusters names in a daily clustering. (This value is currently set inside create_daily_scores_table
but naturally should not. Needs to be improved in next version.)
def plot_stats(self, **kwargs)
is useful for EDA, right after loading the data.
def plot_daily_scores(self, cluster_names: list)
plots a line plot for clusters scores in cluster_names
. Make sure that clusters are named and scored.
def plot_n_random_scores(self, n=5)
plots n random clusters scores based on plot_daily_scores()
There are two ways the class can be used: [general clustering and daily clustering (see above)](#3. General Clustering vs. Daily Clustering). Following is a demonstration of how to use both.
General clustering requires the user to manually create the pipeline, and would usually be best for a one-shot DataFrame clustering, whether that DataFrame consists of one day or more. This would be good for debugging or interactive mode data exploration.
An example of general clustering a would look as follows:
# cleaning directory on startup (use carefully)
if config['global']['drastic_cleanup']:
SearchQueriesClustering.drastic_cleanup()
# crating an instance and loading the data
sqc = SearchQueriesClustering(DATA_FILE)
# aggregation
sqc.aggregate_df()
# cleanup
sqc.clean_search_queries()
sqc.remove_i_will()
# crating a doc2vec embedding and performing HDBScan clustering
labeled_queries, clustering_obj = sqc.make_clusters()
# saving model details to file
sqc.document_process_details()
# showing some clusters
sqc.show_some_clusters()
# fetching a cluster as a dict
sqc.get_cluster_as_dict(5)
# manual inference of cluster name
some_cluster = sqc.get_cluster(1)
cluster_name, similarities = sqc.infer_cluster_name(some_cluster)
# naming all clusters and dumping results to a csv file based on model and clustering parameters
sqc.name_all_clusters(dump_csv=True)
# save a DataFrame of scored cluster and also dump to csv
scored = sqc.score_all_clusters(dump_csv=True)
This is most probable scenario is where data of multiple days is being passed to the class constructor as a csv file. Since this is the case, make_daily_clusters()
is used, and it conveniently includes all the preprocessing methods as well the clustering itself for each day separately .
Inside make_daily_clusters()
the following methods will be executed automatically:
aggregate_df()
clean_search_queries()
remove_i_will()
make_clusters()
This will be executed if make_daily_clusters()
argument score=True
, which is default:
score_all_clusters()
Note: the naming method name_all_clusters()
is called from inside score_all_clusters()
.
Once make_daily_clusters()
is done, there are a few handy out-fo-the-box functions that can be used to analyse the results:
- Use
create_daily_scores_table()
to create a DataFrame in which you can view scores across time. - Use
plot_daily_scores(['query a', 'query b', 'query c'])
to show a line plot of clusters scores over days for some specified queries. - Use
plot_n_random_scores(n=5)
to show 5 random clusters scores over days.
- Creating a Cluster class that would hanle all Cluster methods and properties (at the moment, everything is being handled by the SearchQueriesClustering, the clusterer class)
- Pythonize the module (private methods to be preceded with a _, etc.)
- Addition of other clustering algorithms (DBScan, Agglomarative Clustering to name a couple).
- Optimal parameters for Doc2Vec and the clustering algorithm (probably HDBScan)
- Testing different TaggedDocument approaches
- Understanding interaction of doc2vec embedding matrix size and content and how it affect the different clustering algorithms.
- Further improving both cluster naming and scoring in order to obtain more stable clusters over days.