Comments

Exploratory Data Analysis (EDA)

 the main aspects that must have a consideration are:

- summary of the main characteristics of the data. 

- uncover the relationship between the variable

- understand the dataset. 

- extract the main variable 

- finally, try to figure out which feature has the main impact, for example, if we have a dataset of car-price the first question must be " which character has the main impact on the car price?" 


To cover all these aspects in EDA we must have take a look into: 

  • Descriptive Statistics

- Describe the basic feature of dataset

The Code: 

* df.describe() 

which give you (mean, total number, std, distributed of variable) and it skipped NAN values, so if this method applies to a dataframe with NAN values  NAN values will be excluded.

* category_count = df["name"].value_count().to_frame

* category_count.rename(columns={'name':'name_count'},inplace=True)

to visualization our result we use Box Plot 



Box_Plot
Code: 

sns.boxplot(x="drive_wheel", y="price", data=df)


Scatter plot: 

x-axis: predictor/independent variable

y-axis: target/dependent variable

Code:

y = df["price"]

x = df["engin_size"]

plt.scatter(x,y)

plt.title("ScatterPlot engin_size vs price")

plt.ylable("price")

plt.xlable("engin_size")

- Given a short summary about the sample and measure of the dataset. 

- Group BY

it applied on categorical variables to group data into categories with single or multiple variables.

The Code:

df_name = df[['col1', 'col2', 'col3']]

df_group = df_name.groupby(['New_col', 'New_col1'], as_index-False).mean()

df_group


*Pivot()

One variable displayed along the columns and the other variable displayed along the rows 

df_pivot = df_grp.pivot(index= 'df_name', column='column_name')

*Heatmap

plot target variable over multiple variables;

The Code:

plt.pcolor(df_pivot, cmap='RdBu')

plt.colorbar()

plt.show()

summarize the category data 

Correlation

Measures to what extent different variables are interdependent.if one variable Is change how impact that to another variable.

(lung cancer-smoking), (rain, umbrella) 

- Positive Linear Relationship if the slop of line is positive

sns.regplot(x="col1", y="col2", data=datafram_name)

plt.ylim(0, )

- Negative Linear Relationship if the slop is negative


Correlation-Statistics. 

 Pearson Correlation: measure the strength of the correlation between two features

     - correlation coefficient (close to +1/large positive relationship, close to -1/larg negative relationship, 0/No relationship)

     - P-value(p_vlaue<0.001 Strong in result, <0.05 Moderate in result, <0.1 weak in result, >0.1 no certainty in the result)

Strong correlation (correlation coefficient +1, -1 OR p_value<0.001) 

The Code:

pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price')

person correlation: 0.81 

p-value: 9.35 e-48 so the relationship is strong

in correlation using heatmap the figure must have slop line with the same color


Association between two categorical variables(test for association): Chi-Square:

The Code: 

scipy.stats.chie_continqency(cont_table, correction = True)

it gives the expected value randomly.

The null hypothesis in the Chi-square test the two categorical variables are independent. 


Share on Google Plus

About Inas AL-Kamachy

    Blogger Comment
    Facebook Comment

0 Comments:

Post a Comment