读入csv文件,一般用于返回~
供快速参考。
获取总行数、每个属性的类型、非空值的数量。
housing["ocean_proximity"].value_counts()
# 输出
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
设置最大输出行数
pd.set_option('max_rows', 7)
例如:总数、平均值、标准差、最大值最小值、25%/50%/75%值
hist()方法依赖于Matplotlib,而Matplotlib又依赖于一个用户指定的图形后端去在你的屏幕上绘制。在Jupyter notebook中可用“%matplotlib inline”告诉Jupyter安装 Matplotlib使用时,使用Jupyter拥有的后端。
Jupyter中,show()方法是可选的,因为如果有图形需要输出,它会自动绘制。
%matplotlib inline # only in a Jupyter notebook
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
save_fig("attribute_histogram_plots")
plt.show()
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
- cond 如果为True则保持原始值,若为False则使用第二个参数other替换值。
- other 替换的目标值
- inplace 是否在数据上执行操作
具有标签轴,且算术运算在行和列标签上对齐。
compare_props = pd.DataFrame({
"Overall": income_cat_proportions(housing),
"Stratified": income_cat_proportions(strat_test_set),
"Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100
返回删除了请求轴标签的新对象。
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)
- labels 索引或列标签
- axis 从索引(0),还是列(1)中删除
- inplace 若为True则在原数据执行操作
- method: 可选 {‘pearson’, ‘kendall’, ‘spearman’}
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
- min_periods: Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation
# 计算标准相关系数
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
#输出:
# median_house_value 1.000000
# median_income 0.687160
# total_rooms 0.135097
# housing_median_age 0.114110
# households 0.064506
# total_bedrooms 0.047689
# population -0.026920
# longitude -0.047432
# latitude -0.142724
# Name: median_house_value, dtype: float64
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
save_fig("scatter_matrix_plot")
Return object with labels on given axis omitted where alternately any or all of the data are missing
sample_incomplete_rows.dropna(subset=["total_bedrooms"])
# 用中位数填充
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)
housing_cat = housing['ocean_proximity']
housing_cat.head(10)
# 输出
# 17606 <1H OCEAN
# 18632 <1H OCEAN
# 14650 NEAR OCEAN
# 3230 INLAND
# 3555 <1H OCEAN
# 19480 INLAND
# 8879 <1H OCEAN
# 13685 INLAND
# 4937 <1H OCEAN
# 4861 <1H OCEAN
# Name: ocean_proximity, dtype: object
housing_cat_encoded, housing_categories = housing_cat.factorize()
housing_cat_encoded[:10]
# 输出
# array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)