Customer segmentation using Machine Learning K-Means Clustering 使用K means算法进行客户分类

Customer segmentation using Machine Learning K-Means Clustering

by Rajshekhar Bodhale | Nov 17, 2017 | Machine Learning | 1 comment

Most of platforms build in Information Technologies are generating huge amount of data. This data is called as Big Data and it carries lots of business intelligence. This data is crossing boundaries to meet different goals and opportunities. There is opportunity to apply Machine Learning to create value for clients.

Problems

We have big data based platforms in Accounting and IoT domain that keep on generating customer behavior and device monitoring data.
Identifying targeted customer base or deriving patterns based on different dimensions is key and really provide an edge to the platforms.

Idea

Imagine you got 1000’s of customers using your platform and vast amount of big data that’s keep on generating, any insight on this is really going to value add.

As part of Machine Learning initiatives and innovative things that Patterns7 team keep on trying, we experimented on K-Means Clustering and value it brings to our Clients is awesome.

使用机器学习K-Means聚类进行客户细分

Rajshekhar Bodhale |2017年11月17日|机器学习

基于信息技术的大多数平台正在生成大量数据。这些数据称为大数据，它承载了大量的商业智能。这些数据互相交融以满足不同的目标和可能性。应用机器学习技术就很有可能为客户创造价值。

问题描述

我们在会计学和物联网领域拥有基于大数据的平台，可以持续生成客户行为和设备监控数据。
识别目标客户群或者基于不同维度导出模式非常关键，并且确实为平台提供了优势。

对应想法

假设你有1000个客户使用你的平台并且不断地产生体量庞大的大数据，任何关于这方面的深入见解都将产生新的价值。

作为Patterns7团队不断尝试的机器学习计划和创新事物的一部分，我们对K-Means聚类算法进行了实验，这为客户带来的价值非常棒。

Solution

Clustering is the process of partitioning a group of data points into a small number of clusters. In this part, you will understand and learn how to implement the K-Means Clustering.

K-Means Clustering

K-means clustering is a method commonly used to automatically partition a data set into k groups. It is unsupervised learning algorithm.

K-Means Objective

The objective of k-means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid. Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares where µi is the mean of points in Si.
The k-means algorithm is guaranteed to converge a local optimum.

Business Uses

This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases are:

Behavioral segmentation: Segment by purchase history ,Segment by activities on application, website, or platform.
Inventory categorization:Group inventory by sales activity.
Sorting sensor measurements:Detect activity types in motion sensors ,Group images.
Detecting bots or anomalies:Separate valid activity groups from bots.

解决方案

聚类是将一组数据点划分为少量聚类的过程。在本部分中，你将理解并学习到如何实现K-Means聚类。

K-Means聚类

K-Means聚类是一种常用语将数据集自动划分为K个组的方法，它是无监督学习算法。

K-Means目标

K均值的目的是最小化每个点与其对应的聚类质心的平面距离的总和。给定一组观测值（x1,x2,…,xn），其中每一个观测值都是d维实数向量，K均值聚类旨在将n个观测值划分为k（k≤n）个集合S={S1,S2,…,Sk}以最小化聚类内的平方和，其中µi是Si中的点的平均值。
保证K-Means算法收敛到局部最优。

商业价值

这是一种多功能算法，可用于任何类型的分组。部分使用案例如下：

行为细分：按购买历史记录细分，按应用程序、网站或者购买平台上的活动细分。
库存分类：按照销售活动分组库存。
对传感器测量数据进行排序：检测运动传感器中的活动类型、组图像。
检测机器人或异常操作：从机器人中分离出有效地活动组。

K-Means Clustering Algorithm

Step 1: Choose the number K of clusters.
Step 2: Select at random K points, the centroids.(not necessarily from your dataset)
Step 3: Assign each data point to the closest centroid -> That forms K clusters.
Step 4: Compute and place the new centroid of each cluster.
Step 5: Reassign each data point to the new closest centroid. If any reassignment took place, go to Step 4, otherwise go to FIN.

Example: Applying K-Means Clustering to Customer Expenses and Invoices Data in python.

For python i am using Spyder Editor. As an example, we’ll show how the K-means algorithm works with a Customer Expenses and Invoices Data.We have 500 customers data we’ll looking at two customer features: Customer Invoices, Customer Expenses. In general, this algorithm can be used for any number of features, so long as the number of data samples is much greater than the number of features.

Step 1: Clean and Transform Your Data

For this example, we’ve already cleaned and completed some simple data transformations. A sample of the data as a pandas DataFrame is shown below. Import libraries in python i.e.

numpy for mathematical tool to include any types of mathematics in our code.
matplotlib.pyplot it help to plot nice chart.
pandas for import dataset and manage dataset.

Step 2: We want to apply clustering on Total Expenses and Total Invoices. So select required columns in X.

k – means聚类算法

步骤1:选择集群的数量K。
步骤2:随机选择K个点，作为质心。(不一定要从你的数据集中选择)
步骤3:将每个数据点分配到-> 构成K簇的最近的质心。
步骤4:计算并重新放置每个集群的新质心。
步骤5:将每个数据点重新分配到最近的质心。如果有任何重置发生，转到步骤4，否则转到FIN。

示例:在python中对客户费用和发票数据应用K-Means集群。

对于python，我使用的是Spyder Editor。

下面，我们将展示K-means算法如何处理客户费用和发票数据的例子。

我们有500个客户数据，我们关注两个客户特征: 客户发票，客户费用。

一般来说，该算法可以用于任意数量的特征，只要数据样本的数量远远大于特征的数量，

步骤1:清理和转换数据

对于这个示例，我们已经清理和做了一些简单的数据转换。下面是pandas DataFrame的数据样本。

导入库，

1、numpy 即用于数学工具的，以在我们的代码中包含任何类型的数学
2、matplotlib 绘制漂亮的图表
3、pandas 用于导入数据集和管理数据集

步骤2: 我们对总费用和总发票应用聚类。在X中选择必需的列。

The chart below shows the dataset for 500 customers, with the Total Invoices on the x-axis and Total Expenses on the y-axis.

Step 3: Choose K and Run the Algorithm

Choosing K

The algorithm described above finds the clusters and data set labels for a particular pre-chosen K. To find the number of clusters in the data, the user needs to run the K-means clustering algorithm for a range of K values and compare the results. In general, there is no method for determining exact value of K, but an accurate estimate can be obtained using the following techniques.

One of the metrics that is commonly used to compare results across different values of K is the mean distance between data points and their cluster centroid. Since increasing the number of clusters will always reduce the distance to data points, increasing K will always decrease this metric, to the extreme of reaching zero when K is the same as the number of data points. Thus, this metric cannot be used as the sole target. Instead, mean distance to the centroid as function of K is plotted and the “elbow point,” where the rate of decrease sharply shifts, can be used to roughly determine K.

Using the elbow method we find the optimal number of clusters i.e. K=3. For this example, use the Python packages scikit-learn for computations as shown below:

# K-Means Clusteringnn# importing the librariesnimport numpy as npnimport matplotlib.pyplot as pltnimport pandas as pdnn# importing tha customer Expenses Invoices dataset with pandasndataset=pd.read_csv('Expense_Invoice.csv')nX=dataset.iloc[: , [3,2]].valuesnn# Using the elbow method to find  the optimal number of clustersnfrom sklearn.cluster import KMeansnwcss = []nfor i in range(1, 11):n  kmeans=KMeans(n_clusters=i, init='k-means++', max_iter= 300, n_init= 10, random_state= 0)n  kmeans.fit(X)n  wcss.append(kmeans.inertia_)nplt.plot(range(1, 11),wcss)nplt.title('The Elbow Method')nplt.xlabel('Number of clusters K')nplt.ylabel('Average Within-Cluster distance to Centroid (WCSS)')  nplt.show()n  n# Applying k-means to the mall datasetnkmeans=KMeans(n_clusters=3, init='k-means++', max_iter= 300, n_init= 10, random_state= 0)ny_kmeans=kmeans.fit_predict(X)nn# Visualizing the clustersnplt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label='Careful(c1)')nplt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label='Standard(c2)')nplt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label='Target(c3)')nplt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 250, c = 'yellow', n            label='Centroids')nplt.title('Clusters of customer Invoices & Expenses')nplt.xlabel('Total Invoices ')nplt.ylabel('Total Expenses')nplt.legend()nplt.show()

Step 4: Review the Results

The chart below shows the results. Visually, you can see that the K-means algorithm splits the three groups based on the invoice feature. Each cluster centroid is marked with a yellow circle. Now customers are divided into

“careful” who’s income is less also they spend less.
“Standard” who’s income is Average and they spends less and,
“Target ” who’s income is more and they spends more .

下图显示了500个客户的数据集，总发票在x轴，总费用在y轴。

步骤3:选择K并运行算法
选择K

上面描述的算法找到一个特定的预先选择K的集群和数据集标签。
为了找到数据中的集群数量，用户需要运行K-means聚类算法对K个值的范围进行聚类并比较结果。一般来说，没有确定K的精确值的方法，但是可以使用以下技术得到精确的估计值。

通常用于比较不同K值之间的结果的度量之一是：

数据点与它们的集群中心之间的平均距离。

因为增加集群的数量总是会减少到数据点的距离，所以增加K总是会减少这个度量，当K等于数据点的数量时达到0的极限。因此，这个指标不能作为唯一的目标。相反，将与质心的平均距离作为K的函数绘制出来，并使用“弯头点”(急剧下降的速度)来粗略地确定K。

用弯头法求出最优簇数K=3。对于本例，使用Python包scikit-learn进行计算，如下所示:

# K-Means Clustering

# importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# importing tha customer Expenses Invoices dataset with pandas
dataset=pd.read_csv('Expense_Invoice.csv')
X=dataset.iloc[: , [3,2]].values

# Using the elbow method to find  the optimal number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
  kmeans=KMeans(n_clusters=i, init='k-means++', max_iter= 300, n_init= 10, random_state= 0)
  kmeans.fit(X)
  wcss.append(kmeans.inertia_)
plt.plot(range(1, 11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters K')
plt.ylabel('Average Within-Cluster distance to Centroid (WCSS)')  
plt.show()
  
# Applying k-means to the mall dataset
kmeans=KMeans(n_clusters=3, init='k-means++', max_iter= 300, n_init= 10, random_state= 0)
y_kmeans=kmeans.fit_predict(X)

# Visualizing the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label='Careful(c1)')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label='Standard(c2)')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label='Target(c3)')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 250, c = 'yellow', 
            label='Centroids')
plt.title('Clusters of customer Invoices & Expenses')
plt.xlabel('Total Invoices ')
plt.ylabel('Total Expenses')
plt.legend()
plt.show()