To ensure validity, you will need to know the correctness of your results as a Data Scientist. The data science pipeline is a planned endeavor with defined parameters. Enabling you to evaluate each stage and how it contributed to your production.
What exactly is probability?
Probability is a measure of the possibility of an event occurring. It is an important component in predictive analysis since it allows you to investigate the computational math behind your results.
Consider tossing a coin and seeing if it comes up heads (H) or tails (T). The number of ways an event can occur divided by the total number of conceivable outcomes gives you your likelihood.
- If we wish to calculate the probability of heads, we can do so as 1 (Head) / 2 (Heads and Tails) = 0.5.
- If we wish to calculate the probability of tails, we can do so as 1 (Tails) / 2 (Heads and Tails) = 0.5.
But there is a distinction to be made between likelihood and probability. Probability is a measure of the likelihood of a given event or outcome occurring. When you wish to raise the likelihood of a given event or outcome occurring, you use likelihood.
To simplify, probability is concerned with probable outcomes, whereas likelihood is concerned with hypotheses.
“Mutually exclusive events” is another term to be familiar with. These are events that do not occur concurrently. You cannot, for example, go right and left at the same moment. If we flip a coin, we can only receive heads or tails, not both.
Types of Probability
- Theoretical Probability: This is built on the foundation of thinking and focuses on how likely an occurrence is to occur. The expected value is the outcome, according to theory. In the case of heads and tails, the theoretical probability of falling on heads is 0.5, or 50%.
- Experimental Probability: This focuses on how frequently an event occurs during the course of an experiment. Considering the head and tails example, if we toss a coin ten times and it lands on heads six times, the experimental probability of the coin landing on heads is 6/10, or 60%.
Probability with Conditions
The chance of an event/outcome occurring dependent on an existing event/outcome is referred to as conditional probability. For example, if you work for an insurance business, you may wish to determine the likelihood of a person being able to pay for his insurance based on the condition that they have taken out a mortgage.
By utilizing other factors in the dataset, Conditional Probability assists Data Scientists in producing more accurate models and outputs.
Distribution
A probability distribution is a statistical function that aids in the description of the possible values and probabilities for a random variable within a specific range. The range will have probable minimum and maximum values, and their placement on a distribution graph will be determined by statistical testing.
You can determine what sort of distribution you are using based on the type of data used in the project. I’ll divide them into two groups: discrete distribution and continuous distribution.
Discrete Distribution
When data has a discrete distribution, it can only take on particular values or has a restricted number of outcomes. For example, if you roll a die, your only options are 1, 2, 3, 4, 5, and 6.
There are various kinds of discrete distribution. As an example:
- When all outcomes are equally likely, the distribution is said to be discrete uniform. When we roll a six-sided die, there is an equal chance that it will land on 1, 2, 3, 4, 5, or 6 – 16. The problem with discrete uniform distribution, on the other hand, is that it does not offer us with useful information that data scientists can utilize and apply.
- Another sort of discrete distribution is the Bernoulli distribution, which has only two possible outcomes: yes or no, 1 or 2, true or false. When flipping a coin, this can be used to determine whether it is heads or tails. We get the probability of one of the outcomes (p) when we use the Bernoulli distribution, and we may subtract it from the total probability (1), which is represented as
- The Binomial Distribution is a sequence of Bernoulli occurrences that produces a discrete probability distribution with only two possible outcomes in an experiment: success or failure. When flipping a coin, the probability of flipping a coin is always 1.5 or 12 in every experiment.
- The Poisson Distribution describes how many times an event is expected to occur over a given time or distance. It focuses on the frequency of an event occurring in a given interval rather than the occurrence itself. For example, if 12 cars go along a specific road at 11 a.m. every day, we can use Poisson distribution to calculate how many cars drive down that road at 11 a.m. every month.
Continuous Distribution
Continuous distributions, as opposed to discrete distributions, have continuous results. Because the data is continuous, these distributions usually appear as a curve or a line on a graph.
- Normal Distribution is the most commonly used and you may have heard of it. There is no skew in the symmetrical distribution of values around the mean. When the data is plotted, it takes the shape of a bell, with the middle range representing the mean. Characteristics such as height and IQ scores, for example, follow a normal distribution.
- The T-Distribution is a sort of continuous distribution that is employed when the population standard deviation () is unknown and the sample size (n30) is small. The bell curve has the same form as a normal distribution. For instance, if we want to know how many chocolate bars were sold in a given day, we would utilize the normal distribution. But, if we want to know how many items were sold in a given hour, we will use t-distribution.
- The exponential distribution is a sort of continuous probability distribution that focuses on the time it takes for an event to occur. For example, if we want to investigate earthquakes, we can utilize exponential distribution. The period of time between now and the occurrence of an earthquake. The probabilities are represented exponentially by the exponential distribution, which is displayed as a curved line.
Conclusion
You can see from the examples above how data scientists can use probability to learn more about data and answer questions. It is incredibly useful for data scientists to know and comprehend the likelihood of an event occurring, and it may be quite successful in decision-making.
You will be working with data all the time, and you must learn more about it before completing any type of analysis. Looking at the data distribution can provide you with a wealth of information that you can use to tailor your job, process, and model to the data distribution.
This saves you time comprehending the data, creates a more efficient workflow, and results in more accurate outputs.
Several data science principles are founded on the fundamentals of probability.