特征处理和生成

数值特征

scaling

基于树的模型不依赖scaling，非基于树的模型恰恰相反
当两个属性数量级的差距很大时，原来微小的距离，将变的很大，这对KNN、linear models有很大影响。
梯度下降法在没有适当放缩的情况下会变的很糟糕，由于这个原因，神经网络在特征预处理上与线性模型相似。
标准化不影响分布
在MinMaxScaling或StandardScaling转换之后，特性对非树模型的影响大致相同。

outliers离群点

离群点既可以出现在特征值X里，也可以在目标值y中，这会对模型产生影响
我们可以将特征值控制在两个设定的下界和上界之间，例如第一百分位数和99百分位数之间。

rank

例子：

rank([-100, 0, 1e5])  =>  [0,1,2]
rank([1000, 1, 10])   =>  [2,0,1]

这个转换可能比MinMaxScaler更好，因为秩转换将使异常值更接近其他对象
如果我们没有时间手动处理异常值，线性模型、KNN和神经网络可以从这种转换中获益
需要注意的是，它也需要被应用在测试集上，你可以合并后一起处理。
可以在scipy.stats.rankdata中找到

转换

Log transform:
np.log(1 + x)
Raising to the power < 1:
np.sqrt(x + 2/3)
这两种转换都是有用的，因为它会使大的值更接近特征的平均值，使接近零的值更容易区分。
有时候，在不同预处理产生的连接数据帧上训练模型，或者在混合模型上训练不同的预处理数据是有益的。
它能帮助非基于树的模型，例如线性模型、KNN特别是神经网络。

Feature generation

他是用关于特征的知识和任务来生成新特征，它让模型更简单有效。简单来说就是用先验知识、逻辑推理、直觉来创建新的特征。

房价上，知道面积和房价之后，可以创建'每平米'的价钱
在Forest Cover Type Prediction dataset上，可以对当前点到水源地建立不同的距离特征
还可以提取价格的小数部分，这可以区分消费概念。甚至可以借着这个区分是否为机器生成的异常数据，例如小数部分是0.212895..很长一串

Categorical and ordinal features（标签和顺序特征）

区别标签：无顺序上区别的，例如男、女
顺序特征：在顺序上有特别意义，例如小学、初中、大学，这是有递增关系的
将标签映射为数字
- Alphabetical(sorted):[S,C,Q] -> [2,1,3]
  sklearn.preprocessing.LabelEncoder
- Order od appearance:[S,C,Q] -> [1,2,3]
  Pandas.factorize

将标签替换为其频率，这代表了值的分布信息，可用于树模型和线性模型
[S,C,Q] -> [0.5,0.3,0.2]

encoding = titanic.groupby('Embarked').size()
encoding = encoding / len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)

Label and Frequency编码常用于树模型
One-hot编码常用于非树模型
标签的相互联结可以帮助线性模型和KNN

pclass	sex	pclass_sex
3	male	3male
1	femal	1femal
3	femal	3fmeal
1	femal	1femal

Datetime and coordinates

Datetime

Periodicity 周期性
在周、月、季节、年中的天数；秒、分、小时
Time since row-independent/dependent event
- 行无关时刻：since 00:00:00 UTC,1 January 1970
- 行相关重要时刻：离下一个假期天数、上一个假期过去天数

Data	week day	daynumber_since_year_2014	is_holiday	days_till_holidays	sales
01.01.14	5	0	True	0	1213
02.01.14	6	1	False	3	938
03.01.14	0	2	False	2	2448
04.01.14	1	3	False	1	1744
05.01.14	2	4	True	0	1732
06.01.14	3	5	False	9	1022

Difference between dates
datetime_feature_1 - datetime_feature_2

user_id	registration_date	last_purchase_date	last_call_date	date_diff	churn
14	10.02.2016	21.04.2016	26.04.2016	5	0
15	10.02.2016	03.06.2016	01.06.2016	-2	1
16	11.02.2016	11.01.2017	11.01.2017	1	1
20	12.02.2016	06.11.2016	08.02.2017	94	0

Coordinates

Interesting places from train/test data or additional data
使用各种距离构造特征
Centers of clusters
利用到聚类中心的距离
Aggregated statistic
计算周边对象的汇总统计信息
trick
利用决策树根据经纬度信息，将地区分为两个部分

处理缺失值

根据具体情况决定填充缺失值
通常用-999、-1、mean、median替换缺失值
Missing values already can be replaced with something by organizers.
二值特征"isnull"也是有用的
通常来说，在feature generation之前应该避免去填充nan
Xgboost可以处理NaN

从文本和图像中提取特征

Bag of words

为了统一尺度，我们使用正则化后的"Bag of words"

Term frequency

tf = 1/x.sum(axis=1)[:, None]
x = x*tf

为了提取重点，降低常出现词的频率，有了这个：

Inverse Document Frequency

idf = np.log(x.shape[0] / (x > 0).sum(0))
x = x*idf

这可以在sklearn.feature_extraction.text.TfidfVectorizer找到。

N-grams

可以利用于句子上。

sklearn.feature_exraction.text.CountVectorizer: Ngram_range, analyzer

Pipeline of applying BOW(Conclusion)

1.Preprocessing:

Lowercase: Very和very的区别
stemming：democracy, democratic, and democratization -> democr
lemmatization:democracy, democratic, and democratization -> democracy
stopwords:一般在网上可以搜到这样的表

sklearn.feature_extraction.text.CountVectorizer: max_df

2.Ngrams can help to use local context

3.Postprocessing:TFiDF

BOW and w2v comparison

1.Bag of words
- a. Very large vectors
- b. Meaning of each value in vector is known
Word2vec
- a. Relatively small vectors
- b. Values in vevtor can be interpreted only in some cases
- c. The words with similar meaning often have similar embedding

Feature extraction from text and images

1.Texts
- a.Preprocessing
  Lowercase, stemming, lemmarization, stopwords
- b.Bag of words
  1.Huge vectors
  2.Ngrams can help to use local context
  3.TFiDF can be of use as postprocessing
- c.Word2vec
  1.Relatively small vectors
  2.Pretrained models
2.Image
- a.Features can be extracted from different layers
- b.Careful choosing of pretrained network can help
- c.Finetuing allows to refine pretrained models
- d.Data augmentation can improve the model.

探索数据分析

Building intuition about the data

Get domain knowledge
- It helps to deeper understand the problem
Check if the data is intuitive
- And agrees with domain knowledge
Understand how the data was generated
- As it is crucial to set up a proper validation(很可能训练集数据分布与测试集不同，导致验证集错误)

Exploring anonymized data

Try to decode the features
- Guess the true meaning of the feature
Guess the feature types
- Each type needs its own preprocessing

Mean encodings

Using target to generate features

example

	feature	feature_label	feature_mean	target
0	Moscow	1	0.4	0
1	Moscow	1	0.4	1
2	Moscow	1	0.4	1
3	Moscow	1	0.4	0
4	Moscow	1	0.4	0
5	Tver	2	0.8	1
6	Tver	2	0.8	1
7	Tver	2	0.8	1
8	Tver	2	0.8	0
9	Klin	0	0.0	0
10	klin	0	0.0	0

feature_mean=mean(target)

这个操作可以让本来无序的类别标签变得有序。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

9.Kaggle杂记.md

9.Kaggle杂记.md

特征处理和生成

数值特征

scaling

outliers离群点

rank

转换

Feature generation

Categorical and ordinal features（标签和顺序特征）

Datetime and coordinates

Datetime

Coordinates

处理缺失值

从文本和图像中提取特征

Bag of words

N-grams

Pipeline of applying BOW(Conclusion)

1.Preprocessing:

2.Ngrams can help to use local context

3.Postprocessing:TFiDF

BOW and w2v comparison

Feature extraction from text and images

探索数据分析

Building intuition about the data

Exploring anonymized data

Mean encodings

Using target to generate features

Files

9.Kaggle杂记.md

Latest commit

History

9.Kaggle杂记.md

File metadata and controls

特征处理和生成

数值特征

scaling

outliers离群点

rank

转换

Feature generation

Categorical and ordinal features（标签和顺序特征）

Datetime and coordinates

Datetime

Coordinates

处理缺失值

从文本和图像中提取特征

Bag of words

N-grams

Pipeline of applying BOW(Conclusion)

1.Preprocessing:

2.Ngrams can help to use local context

3.Postprocessing:TFiDF

BOW and w2v comparison

Feature extraction from text and images

探索数据分析

Building intuition about the data

Exploring anonymized data

Mean encodings

Using target to generate features