DATA- refers to facts and statistics collected together for reference or analysis.
1)qualitative-cant be measured
a)Nominal-male female
b)Ordinal-good,best,bad
2)quantitative
a)Discrete-finite value
b)Continous-infinite number
STATICTICS- area of applied math concerned with data collection, analysis, interpretation and presentation.
Types Of Statistics
1)Descriptive
a)Measures of central tendency(mean,median,mode)
b)measures of variability(range,variance, standard deviation,inter qyartile range)
2)Inferential
Mean: total sum/no.of items
Median: middle values in the ascending order
Mode: most recurrent item
Range: measure of how spread apart the values in a dataset are.
range=max-min
Inter quartile range: measure of variability based on diving a dataset into quartiles.
Variance: how mush a random variable differs from its expected value.
Standard Deviation: measure of dispersion of a set of data from its mean.
INFORMATION GAIN AND ENTROPY
entropy:measire of uncertainty or impurity
infor gain:measure of how much information a particular feature or variable gives about final outcome.
CONFUSION MATRIX
matrix used to describe the performance of classification modelon a set of test data for which true values are known
PROBABILITY
Ratio of desired outcomes to total outcomes
PROBABILITY DISTRIBUTION
1)Probability density function:equation describing a conti probability distribution between range
2)Normal distribution: is a probability distributionthat associates the normal random variable X with a cumulative probability
3)central limit theorem:mean of any independent random variable will be normal or nearly normal if the sample size is large enough.
TYPES OF PROBABILITY
1)Marginal: pro of occurrence of a single event
2)Joint pro:measure of two events happening at the same time
3)Conditional pro: outcome is based on the occurrence of a previous event or outcome
BAYES' THEOREM(USED IN NAVIE BAYS)
Shows the relation between conditional prob and its inverse
INFERENTIAL STATISTICS
POINT ESTIMATION
this uses a sample of data from the total population which gives approx. value of the total population
a)point estimate: exact point of estimation
b)interval estimate: some range of values as an estimation
Margin Of Error
It is the greatest possible distance btw the point estimate and the values of the parameter it is estimating.
HYPOTHESIS TESTING
Used to check whether the hypothesis is accepted or not formally.
Null hypothesis: approves the assumption
Alternate hypothesis: discards the assumption
BASICS OF MACHINE LEARNING
Need for ML
Increase in data generation
Improve decision making
Uncover patterns & trends in data
Solves complex problems
WHAT IS ML?
A computer program is said to learn from Experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with Experience E.
-provides machines the ability to learn automatically & improve without being explicitly programmed.
Algorithm: A set of rules and statistical techniques used to learn patterns from data
Model: A model is trained by using ML alg
Predictor variable: feature of the data that can be used to predict the output
Response Vari: vari that needs to predicated
Training data: data used for training
testing data: data used for testing
ML PROCESS
building a predictive model that can be used to find a solution