每日一练 | Data Scientist & Business Analyst 面试题130

http://mp.weixin.qq.com/s/xg5dx5mDVw_9s5W7x7PT5Q

http://mp.sohu.com/profile?xpt=ZGF0YWxhdXNAc29odS5jb20=&\_f=index\_pagemp\_2

1、What are the advantages and disadvantages of k-nearest neighbors?

2、What is the Box-Cox transformation used for?

The Box-Cox transformation is a generalized "power transformation" that transforms data to make the distribution more normal.

It's used to stabilize the variance (eliminate heteroskedasticity) and normalize the distribution.

解决方差齐性问题,用BOX-COX变换能在一定程度上减小不可观测误差和预测变量之间的相关性。

3、While working on a data set, how do you select important variables? Explain your methods.

  • Remove the correlated variables prior to selecting important variables
  • Use linear regression and select variables based on p values
  • Use Forward Selection, Backward Selection, Stepwise Selection
  • Use Random Forest, Xgboost and plot variable importance chart
  • Measure information gain for the available set of features and select top n features accordingly.

4、You are working on a time series data set. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?

Time series data is known to posses linearity. On the other hand, a decision tree algorithm is known to work best to detect non – linear interactions. The reason why decision tree failed to provide robust predictions because it couldn’t map the linear relationship as good as a regression model did. Therefore, we learned that, a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions.

5、What are the advantages and disadvantages of k-nearest neighbors?

Advantages: K-Nearest Neighbors have a nice intuitive explanation, and then tend to work very well for problems where comparables are inherently indicative. For example, you could build a kNN housing price model by modeling on other houses in the area with similar number of bedrooms, floor space, etc.

Disadvantages: They are memory-intensive.They also do not have built-in feature selection or regularization, so they do not handle high dimensionality well.

6、What’s the “kernel trick” and how is it useful?

Kernel trick:

Kernel functions can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space.

Why it is useful:

It can calculate the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates. Many algorithms can be expressed in terms of inner products. Using the kernel trick enables us effectively run algorithms in a high-dimensional space with lower-dimensional data.

7、What is Rating, Average Persons and Spot?

Rating
Rating is the percentage (0 to 100) of the Media Market that will likely be exposed to your advertisement. Rating is an estimate based on past performance often sourced from surveys.

Average Persons
Average Persons is the number of people that, on average, will be exposed to each Spot. Average Persons is calculated by multiplying Population by Rating then dividing by 100.

Spot
A Spot is a single broadcast of an advertisement. Typically, an advertising placement includes multiple spots.

Reference:
http://www.bionic-ads.com/2016/03/reach-frequency-ratings-grps-impressions-cpp-and-cpm-in-advertising/

8、What’s the differences between the Poisson distribution and normal distribution?

A Poisson distribution is discrete while a normal distribution is continuous

A Poisson random variable is always >= 0.

When the mean of a Poisson distribution is large, it becomes similar to a normal distribution.

results matching ""

    No results matching ""