the main aspects that must have a consideration are:
- summary of the main characteristics of the data.
- uncover the relationship between the variable
- understand the dataset.
- extract the main variable
- finally, try to figure out which feature has the main impact, for example, if we have a dataset of car-price the first question must be " which character has the main impact on the car price?"
To cover all these aspects in EDA we must have take a look into:
- Descriptive Statistics
- Describe the basic feature of dataset
The Code:
* df.describe()
which give you (mean, total number, std, distributed of variable) and it skipped NAN values, so if this method applies to a dataframe with NAN values NAN values will be excluded.
* category_count = df["name"].value_count().to_frame
* category_count.rename(columns={'name':'name_count'},inplace=True)
to visualization our result we use Box Plot
Code:
sns.boxplot(x="drive_wheel", y="price", data=df)
Scatter plot:
x-axis: predictor/independent variable
y-axis: target/dependent variable
Code:
y = df["price"]
x = df["engin_size"]
plt.scatter(x,y)
plt.title("ScatterPlot engin_size vs price")
plt.ylable("price")
plt.xlable("engin_size")
- Given a short summary about the sample and measure of the dataset.
- Group BY
it applied on categorical variables to group data into categories with single or multiple variables.
The Code:
df_name = df[['col1', 'col2', 'col3']]
df_group = df_name.groupby(['New_col', 'New_col1'], as_index-False).mean()
df_group
*Pivot()
One variable displayed along the columns and the other variable displayed along the rows
df_pivot = df_grp.pivot(index= 'df_name', column='column_name')
*Heatmap
plot target variable over multiple variables;
The Code:
plt.pcolor(df_pivot, cmap='RdBu')
plt.colorbar()
plt.show()
summarize the category data
Correlation
Measures to what extent different variables are interdependent.if one variable Is change how impact that to another variable.
(lung cancer-smoking), (rain, umbrella)
- Positive Linear Relationship if the slop of line is positive
sns.regplot(x="col1", y="col2", data=datafram_name)
plt.ylim(0, )
- Negative Linear Relationship if the slop is negative
Correlation-Statistics.
Pearson Correlation: measure the strength of the correlation between two features
- correlation coefficient (close to +1/large positive relationship, close to -1/larg negative relationship, 0/No relationship)
- P-value(p_vlaue<0.001 Strong in result, <0.05 Moderate in result, <0.1 weak in result, >0.1 no certainty in the result)
Strong correlation (correlation coefficient +1, -1 OR p_value<0.001)
The Code:
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price')
person correlation: 0.81
p-value: 9.35 e-48 so the relationship is strong
in correlation using heatmap the figure must have slop line with the same color
Association between two categorical variables(test for association): Chi-Square:
The Code:
scipy.stats.chie_continqency(cont_table, correction = True)
it gives the expected value randomly.
The null hypothesis in the Chi-square test the two categorical variables are independent.
0 Comments:
Post a Comment