The 5 Basic Statistics Concepts Data Scientists Need to Know 数据科学家应当了解的五个统计基本概念：统计特征、概率分布、降维、过采样/欠采样、贝叶斯统计

The 5 Basic Statistics Concepts Data Scientists Need to Know

Statistics can be a powerful tool when performing the art of Data Science (DS). From a high-level view, statistics is the use of mathematics to perform technical analysis of data. A basic visualisation such as a bar chart might give you some high-level information, but with statistics we get to operate on the data in a much more information-driven and targeted way. The math involved helps us form concrete conclusions about our data rather than just guesstimating.

Using statistics, we can gain deeper and more fine grained insights into how exactly our data is structured and based on that structure how we can optimally apply other data science techniques to get even more information. Today, we’re going to look at 5 basic statistics concepts that data scientists need to know and how they can be applied most effectively!

Statistical Features

Statistical features is probably the most used statistics concept in data science. It’s often the first stats technique you would apply when exploring a dataset and includes things like bias, variance, mean, median, percentiles, and many others. It’s all fairly easy to understand and implement in code! Check out the graphic below for an illustration.

A basic box plot

The line in the middle is the median value of the data. Median is used over the mean since it is more robust to outlier values. The first quartile is essentially the 25th percentile; i.e 25% of the points in the data fall below that value. The third quartile is the 75th percentile; i.e 75% of the points in the data fall below that value. The min and max values represent the upper and lower ends of our data range.

A box plot perfectly illustrates what we can do with basic statistical features:

When the box plot is short it implies that much of your data points are similar, since there are many values in a small range
When the box plot is tall it implies that much of your data points are quite different, since the values are spread over a wide range
If the median value is closer to the bottom then we know that most of the data has lower values. If the median value is closer to the top then we know that most of the data has higher values. Basically, if the median line is not in the middle of the box then it is an indication of skewed data.
Are the whiskers very long? That means your data has a high standard deviation and variance i.e the values are spread out and highly varying. If you have long whiskers on one side of the box but not the other, then your data may be highly varying only in one direction.

All of that information from a few simple statistical features that are easy to calculate! Try these out whenever you need a quick yet informative view of your data.

数据科学家需要知道的5个基本统计概念

在执行数据科学（DS）艺术时，统计数据可以成为一种强大的工具。从高层次来看，统计学是利用数学来进行数据的技术分析。诸如条形图之类的基本可视化可能会为你提供一些高级信息，但通过统计，我们可以以一种更偏向于信息驱动和有针对性的方式对数据进行操作。所涉及的数学知识帮助我们形成关于数据的具体结论，而不仅仅是猜测。

使用统计数据，我们可以获得更深入，更微观的洞察，了解数据是如何构建的，并基于该结构，我们如何最佳地应用其他数据科学技术来获取更多信息。今天，我们将看看数据科学家需要了解的5个基本统计概念以及如何最有效地应用它们。

统计特征

统计特征可能是数据科学中最常用的统计概念。它通常是你在探索数据集时应用的第一种统计技术，包括偏差，方差，均值，中位数，百分位数等等。在代码中理解和实现都非常容易。请查看下面的图形以获取说明。

一个基本的箱线图

中间的线是数据的中位数。中位数较于均值，因为它对异常值更加稳健。第一个四分位数基本上是第25个百分点; 即数据中25％的点低于该值。第三个四分位数是第75个百分点; 即数据中75％的点低于该值。最小值和最大值表示数据范围的上界和下界。

箱线图完美地说明了我们可以用基本统计特征做什么：

1. 当箱线图很短时，它意味着你的大部分数据点都相似，因为在很小的范围内有许多值。

2. 当箱线图很长时，它意味着你的大部分数据点都非常不同，因为这些值分布在很宽的范围内

3. 如果中位数接近底部，那么我们知道大多数数据具有较低的值。如果中位数接近顶部，那么我们知道大多数数据具有更高的值。基本上，如果中间线不在框的中间，则表明数据偏斜。

4. 变动很长吗？这意味着你的数据具有较高的标准差和方差，即数值分散且变化很大。如果盒子的一侧有变动，而另一侧没有，那么你的数据可能只在一个方向上变化很大。

所有这些信息来自一些易于计算的简单统计特征。只要你需要快速而翔实的数据视图，请尝试这些。

Probability Distributions

We can define probability as the percent chance that some event will occur. In data science this is commonly quantified in the range of 0 to 1 where 0 means we are certain this will not occur and 1 means we are certain it will occur. A probability distribution is then a function which represents the probabilities of all possible values in the experiment. Check out the graphic below for an illustration.

Common Probability Distributions. Uniform (left), Normal (middle), Poisson (right)

A Uniform Distribution is the most basic of the 3 we show here. It has a single value which only occurs in a certain range while anything outside that range is just 0. It’s very much an “on or off” distribution. We can also think of it as an indication of a categorical variable with 2 categories: 0 or the value. Your categorical variable might have multiple values other than 0 but we can still visualize it in the same was as a piecewise function of multiple uniform distributions.
A Normal Distribution, commonly referred to as a GaussianDistribution, is specifically defined by its mean and standard deviation. The mean value shifts the distribution spatially and the standard deviation controls the spread. The import distinction from other distributions (e.g poisson) is that the standard deviation is the same in all directions. Thus with a Gaussian distribution we know the average value of our dataset as well as the spread of the data i.e is it spread over a wide range or is it highly concentrated around a few values.
A Poisson Distribution is similar to the Normal but with an added factor of skewness. With a low value for the skewness a poisson distribution will have relatively uniform spread in all directions just like the Normal. But when the skewness value is high in magnitude then the spread of our data will be different in different directions; in one direction it will be very spread and in the other it will be highly concentrated.

概率分布

我们可以将概率定义为某个事件发生的几率。在数据科学中，这通常在0到1的范围内量化，其中0表示我们确定这不会发生，1表示我们确定它将发生。然后，概率分布是表示实验中所有可能值的概率的函数。请查看下面的图形以获取说明。

常见的概率分布。均匀分布（左），正态分布（中），泊松分布（右）

1. 均匀分布是我们在这里展示的3浮屠中最基本的。它有一个值只出现在一定范围内，而超出该范围的任何东西只有0。这是一个“开关”分布。我们还可以将其视为具有2个类别的分类变量的指示：0或特定值。你的分类变量可能具有除0以外的多个值，但我们仍然可以将其视为多个均匀分布的分段函数。

2. 正态分布，通常称为高斯分布，由其均值和标准差定义。平均值控制在空间上的位置，标准差控制分布的宽窄。与其他分布（例如泊松）的重要区别在于标准差在所有方向上是相同的。因此，利用高斯分布，我们知道数据集的平均值以及数据的扩展，即它是在很大范围内扩展还是高度集中在几个值附近。

3. 泊松分布类似于正态分布但具有附加的偏斜因子。对于低的偏斜值，泊松分布将在所有方向上具有相对均匀的扩展，就像nn正态分布一样。但是当nn偏斜值的幅度较大时，我们的数据在不同方向上的传播会有所不同; 在一个方向上它将非常分散，而在另一个方向上它将高度集中。

There are many more distributions that you can dive deep into but those 3 already give us a lot of value. We can quickly see and interpret our categorical variables with a Uniform Distribution. If we see a Gaussian Distribution we know that there are many algorithms that by default will perform well specifically with Gaussian so we should go for those. And with Poisson we’ll see that we have to take special care and choose an algorithm that is robust to the variations in the spatial spread.

Dimensionality Reduction

The term Dimensionality Reduction is quite intuitive to understand. We have a dataset and we would like to reduce the number of dimensions it has. In data science this is the number of feature variables. Check out the graphic below for an illustration.

Dimensionality Reduction

The cube represents our dataset and it has 3 dimensions with a total of 1000 points. Now with today’s computing 1000 points is easy to process, but at a larger scale we would run into problems. However, just by looking at our data from a 2-Dimensional point of view, such as from one side of the cube, we can see that it’s quite easy to divide all of the colours from that angle. With dimensionality reduction we would then project the 3D data onto a 2D plane. This effectively reduces the number of points we need to compute on to 100, a big computational saving!

Another way we can do dimensionality reduction is through feature pruning. With feature pruning we basically want to remove any features we see will be unimportant to our analysis. For example, after exploring a dataset we may find that out of the 10 features, 7 of them have a high correlation with the output but the other 3 have very low correlation. Then those 3 low correlation features probably aren’t worth the compute and we might just be able to remove them from our analysis without hurting the output.

The most common stats technique used for dimensionality reduction is PCA which essentially creates vector representations of features showing how important they are to the output i.e their correlation. PCA can be used to do both of the dimensionality reduction styles discussed above. Read more about it in this tutorial.

你可以深入研究更多的概率分布，但这些分布已经给了我们很多帮助。我们可以使用均匀分布快速查看和解释我们的分类变量。如果我们看到高斯分布，我们知道有许多算法默认情况下会特别适用于高斯，所以我们应该了解那些。使用泊松分布，我们将看到必须特别小心并选择一种对空间扩散变化具有鲁棒性的算法。

降维

降维一词非常直观易懂。我们有一个数据集，我们希望减少它拥有的维度。在数据科学中，这是特征变量的数量。请查看下面的图形以获取说明。

降维

立方体代表我们的数据集，它有3个维度，总共1000个点。现在，计算1000点很容易处理，但在更大的规模，我们会遇到问题。然而，仅从二维视角（例如从立方体的一侧）查看数据，我们可以看到从该角度划分所有颜色非常容易。通过降维，我们将3维数据投影到2维平面上。这有效地将需要计算的点数减少到100，大大节省了计算量。

我们可以通过特征修剪来降低维数的另一种方法。通过特征修剪，我们基本上想要删除任何对分析不重要的特征。例如，在探索数据集之后，我们可能会发现，在10个特征中，其中7个与输出具有高度相关性，而其他3个具有非常低的相关性。那么这3个低相关特征可能不值得计算，我们可能从分析中删除它们而不会损害输出。

用于降维的最常见的统计技术是PCA，它基本上创建了特征的矢量表示，表明它们对输出的重要性，即它们的相关性。 PCA可用于执行上面讨论的两种降维方式。在本教程中阅读更多相关内容。

Over and Under Sampling

Over and Under Sampling are techniques used for classification problems. Sometimes, our classification dataset might be too heavily tipped to one side. For example, we have 2000 examples for class 1, but only 200 for class 2. That’ll throw off a lot of the Machine Learning techniques we try and use to model the data and make predictions! Our Over and Under Sampling can combat that. Check out the graphic below for an illustration.

Under and and Over Sampling

In both the left and right side of the image above, our blue class has far more samples than the orange class. In this case, we have 2 pre-processing options which can help in the training of our Machine Learning models.

Undersampling means we will select only some of the data from the majority class, only using as many examples as the minority class has. This selection should be done to maintain the probability distribution of the class. That was easy! We just evened out our dataset by just taking less samples!

Oversampling means that we will create copies of our minority class in order to have the same number of examples as the majority class has. The copies will be made such that the distribution of the minority class is maintained. We just evened out our dataset without getting any more data!

过采样和欠采样

过采样和欠采样是用于分类问题的技术。有时，我们的分类数据集可能会过于倾向于一边。例如，我们在类1中有2000个示例，但在类2中只有200个。这将摒弃我们尝试用于建模数据和进行预测的许多机器学习技术。我们的过采样和欠采样可以解决这个问题。请查看下面的图形以获取说明。

过采样和欠采样

在上图的左侧和右侧，我们的蓝色类比橙色类有更多的样本。在这种情况下，我们有2个预处理选项，可以帮助我们的机器学习模型的培训。

欠采样意味着我们将只选择多数类中的一些数据，使用和少数类一样多的例子。应该进行这种选择以保持类的概率分布。那很简单，我们只需少量样品就可以平衡我们的数据集。

过采样意味着我们将创建我们的少数类的副本，以便拥有与多数类相同数量的样本。将制作副本以保持少数类的分布。我们只是在没有获得更多数据的情况下平衡了我们的数据集。

Bayesian Statistics

Fully understanding why we use Bayesian Statistics requires us to first understand where Frequency Statistics fails. Frequency Statistics is the type of stats that most people think about when they hear the word “probability”. It involves applying math to analyze the probability of some event occurring, where specifically the only data we compute on is prior data.

Let’s look at an example. Suppose I gave you a die and asked you what were the chances of you rolling a 6. Well most people would just say that it’s 1 in 6. Indeed if we were to do a frequency analysis we would look at some data where someone rolled a die 10,000 times and compute the frequency of each number rolled; it would roughly come out to 1 in 6!

But what if someone were to tell you that the specific die that was given to youwas loaded to always land on 6? Since frequency analysis only takes into account prior data, that evidence that was given to you about the die being loaded is not being taken into account.

Bayesian Statistics does take into account this evidence. We can illustrate this by taking a look at Baye’s theorem:

Baye’s Theoram

The probability P(H) in our equation is basically our frequency analysis; given our prior data what is the probability of our event occurring. The P(E|H) in our equation is called the likelihood and is essentially the probability that our evidence is correct, given the information from our frequency analysis. For example, if you wanted to roll the die 10,000 times, and the first 1000 rolls you got all 6 you’d start to get pretty confident that that die is loaded! The P(E) is the probability that the actual evidence is true. If I told you the die is loaded, can you trust me and say it’s actually loaded or do you think it’s a trick?!

If our frequency analysis is very good then it’ll have some weight in saying that yes our guess of 6 is true. At the same time we take into account our evidence of the loaded die, if it’s true or not based on both its own prior and the frequency analysis. As you can see from the layout of the equation Bayesian statistics takes everything into account. Use it whenever you feel that your prior data will not be a good representation of your future data and results.

贝叶斯统计

完全理解我们使用贝叶斯统计的原因要求我们首先了解频率统计失败的局限性。频率统计是大多数人在听到“概率”一词时所考虑的统计数据类型。它涉及应用数学来分析某些事件发生的概率，具体而言，我们计算的唯一数据是先验数据。

我们来看一个例子。假设我给你一个骰子，并问你得到6的机会是多大。大多数人只会说它是1/6。事实上，如果我们要进行频率分析，我们会看有人扔10000次并计算每个数字的频率; 它会大致达到1/6。

但是如果有人告诉你，给你的特定骰子总是落在6上呢？由于频率分析仅考虑了先前的数据，因此没有考虑向您提供有关正在装载的模具的证据。

贝叶斯统计确实考虑了这一证据。我们可以通过看一下Baye定理来说明这一点：

贝叶斯理论

我们方程中的概率P（H）基本上是我们的频率分析; 鉴于我们之前的数据，我们的事件发生的概率是多少。根据我们频率分析的信息，我们方程中的P（E | H）被称为似然概率，基本上是我们的证据是正确的概率。例如，如果你想要将模具滚动10000次，而前1000次滚动你得到全部是6，你就会开始相信这个模具已加载！ P（E）是实际证据为真的概率。如果我告诉你骰子被装上了，你能相信我并说它实际装了，或者你认为这是一个技巧。

如果我们的频率分析非常好，那么它就会有一定的分量来表示我们对6的猜测是正确的。同时，我们考虑到我们的装载模具的证据，如果它是真实的或不是基于它自己的先前和频率分析。从方程式的布局可以看出，贝叶斯统计将一切都考虑在内。只要你认为先前的数据不能很好地代表你未来的数据和结果，请使用它。