## Missing Data and Outliers Analysis

### **Missing Data and Outliers Analysis**

- Describe missing data, provide summary of missing data, similar to the analysis in the Chapter 2 (table 3): Count of missing data/percent per variable, type of missing data (NA, null), total percent of missingness per dataset [ – 20pts]
- If you find that there is no missing data, you still have to report your findings. Your reader does not know your data and you have to show how reliable your data is

- Plot visualization of missing data pattern [ – 20pts]
- If you do not have missing data, you still need to plot it to show it to your reader

- Describe what type of missing data you have observed: MCAR, MAR, MNAR or no missing data [- 10pts]
- If you do not have missing data, simply state that you did not observe any missing values

- Select Imputation method that you will be performing and explain why [20pts]:
- list-wise/pair-wise deletions, mean imputation, regression imputation etc. [ – 10pts]

- perform imputation and provide data statistics. For example, if you perform list-wise deletion, how many observations will you use for consequent analysis. If you perform regression imputation, provide statistic summary [-10pts]

- If you do not have any missing data: instead of imputation, perform data normalization, describe which methods you will use, why and perform the normalization. Provide statistical summary of data after the normalization

- Outlier analysis:

Option 1.

__Continuous Y:__- You will perform outlier detection using Z-Score, a parametric outlier detection method in one dimensional space. [30pts]

- Z-Score Formula:

- If the data points > |3|, these are extreme values (outliers):

- You can use z from the library (outliers) or build your own function:

Measure - mean(Measure)) / sd(Measure)

- You can test for outliers only for your dependent variable Y (continuous). How many outliers do you have?

Option 2.

- See Chapter 13.3 Checking for Outliers. http://daviddalpiaz.github.io/appliedstats/model-diagnostics.html# (Links to an external site.). Perform Cook’s Distance. Identify how many outliers are present.

Option 3.

. You cannot perform outlier analysis on categorical data but you can check the balance of you data. For example, for a binary male/female variable, identify % for each value and reflect how can it affect/create bias for your future analyses.__Categorical Y__

**Visual Exploration of Your Data**

- Create
**Univariate analysis**for the variable of your interest (your Y variable). Calculate**skewness**and**kurtosis**and**describe the results**. [histogram, skewness values, kurtosis values, description – 10pts] - Create Bivariate plot
**Box Plot**for your Y variable and one of other important metrics (your X). Describe figure. [box plot, description – 10pts] - If your variables are continuous – Create a
**scatter plot**between your Y and your X. If your variables are categorical – Create a**bar plot**. Describe figure [plot, description – 10pts] - Create a
**multivariate plot**– Use the same plot as in 3 but add another important variable using colored symbols. Describe Figure. Make sure to add legend [scatterplot, description – 10pts]