Exploratory Data Analysis (EDA)

the main aspects that must have a consideration are:

- summary of the main characteristics of the data.

- uncover the relationship between the variable

- understand the dataset.

- extract the main variable

- finally, try to figure out which feature has the main impact, for example, if we have a dataset of car-price the first question must be " which character has the main impact on the car price?"

To cover all these aspects in EDA we must have take a look into:

Descriptive Statistics

- Describe the basic feature of dataset

The Code:

* df.describe()

which give you (mean, total number, std, distributed of variable) and it skipped NAN values, so if this method applies to a dataframe with NAN values NAN values will be excluded.

* category_count = df["name"].value_count().to_frame

* category_count.rename(columns={'name':'name_count'},inplace=True)

to visualization our result we use Box Plot

Code:

sns.boxplot(x="drive_wheel", y="price", data=df)

Scatter plot:

x-axis: predictor/independent variable

y-axis: target/dependent variable

Code:

y = df["price"]

x = df["engin_size"]

plt.scatter(x,y)

plt.title("ScatterPlot engin_size vs price")

plt.ylable("price")

plt.xlable("engin_size")

- Given a short summary about the sample and measure of the dataset.

- Group BY

it applied on categorical variables to group data into categories with single or multiple variables.

The Code:

df_name = df[['col1', 'col2', 'col3']]

df_group = df_name.groupby(['New_col', 'New_col1'], as_index-False).mean()

df_group

*Pivot()

One variable displayed along the columns and the other variable displayed along the rows

df_pivot = df_grp.pivot(index= 'df_name', column='column_name')

*Heatmap

plot target variable over multiple variables;

The Code:

plt.pcolor(df_pivot, cmap='RdBu')

plt.colorbar()

plt.show()

summarize the category data

Correlation

Measures to what extent different variables are interdependent.if one variable Is change how impact that to another variable.

(lung cancer-smoking), (rain, umbrella)

- Positive Linear Relationship if the slop of line is positive

sns.regplot(x="col1", y="col2", data=datafram_name)

plt.ylim(0, )

- Negative Linear Relationship if the slop is negative

Correlation-Statistics.

Pearson Correlation: measure the strength of the correlation between two features

- correlation coefficient (close to +1/large positive relationship, close to -1/larg negative relationship, 0/No relationship)

- P-value(p_vlaue<0.001 Strong in result, <0.05 Moderate in result, <0.1 weak in result, >0.1 no certainty in the result)

Strong correlation (correlation coefficient +1, -1 OR p_value<0.001)

The Code:

pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price')

person correlation: 0.81

p-value: 9.35 e-48 so the relationship is strong

in correlation using heatmap the figure must have slop line with the same color

Association between two categorical variables(test for association): Chi-Square:

The Code:

scipy.stats.chie_continqency(cont_table, correction = True)

it gives the expected value randomly.

The null hypothesis in the Chi-square test the two categorical variables are independent.

Junior 4 Data Scientist

Comments

Flickr

Sponsor

Labels

Blog Archive

Exploratory Data Analysis (EDA)

About Inas AL-Kamachy

0 Comments:

Post a Comment

Recent comments

Flickr

Sponsor

Labels

Blog Archive

Exploratory Data Analysis (EDA)

About Inas AL-Kamachy

RELATED POSTS

0 Comments:

Post a Comment