EDA of Sample Dataset
Objective
Data Scientists and Analysts are often tasked to clean and analyze datasets. We are working with an external research firm who specializes in the application of artificial intelligence to forecasting prices of financial instruments. This firm has developed a proprietary system, called “4sight”, to forecast prices of certain instruments.
To demonstrate the effectiveness of their forecasting system, the vendor has sent us attached sample dataset. The dataset includes signal values generated by the 4sight system as well as historical prices for a well-known broad market ETF.
A Portfolio Manager has asked you to:
1) Review the quality of the data, list any potential errors, and propose corrected values. Please list each quality check error and correction applied.
2) Please analyze the signal’s effectiveness or lack thereof in forecasting ETF price, using whatever metrics you think are most relevant.
3) (Extra credit) Write a 1-2 paragraph summary for the Portfolio Manager addressing your observations about the efficacy and believability of the product, and recommendation for next steps.
4) Please budget at least one hour for completing this exercise. Please include all the intermediate steps when sending your solution to this exercise back to us.
Conclusion
Data issues noticed to discuss / inform vendor of:
- The last six values of Signal are zeros. Why?
- drop last six rows (or fill forward as the stale value is probably what most systems will end up using)
- A large outlier of Signal was found on the 2017-11-13. Why?
- drop row (or fill forward as the stale value is probably what most systems will end up using)
- Adj Close is very different from Close / Low / High / Open. How was it adjusted?
- ignore until more confident about the data
- a large ourlier of Close was found on the 2018-03-19. Why?
- fill the highest value of Close with the next mornings open
- 2017-09-08 to 2017-09-22 look to be filled forward for the Close.
- fill with Open shifted forward 1 Day
- high and low inverted on 2018-03-07 and 2018-07-16
- ignore if you don't plan to test them as features or response
- if you plan to use them. fill Low with the min of [Open, High, Low, Close] and High with max of [Open, High, Low, Close]
On first glance, the signal looks to have no predictive power out of sample (20% of the most recent data). This conclusion was drawn from conducting a residual plot of Signal against Open shifted forward 1 day and Close. To quickly check for some statistical violations the same was done on a tree model which produced similar conclusions.
I'd hold off on doing a lot more analysis as it is unlikely to be valuable on it's own. If we delve further, I would focus on checking the signals power as an interaction term with other features and transforming the problem into a classifier instead of a regression.
%load_ext autoreload
%autoreload 2
%matplotlib inline
from pathlib import Path
import sklearn
import interpret
import yellowbrick
import pandas as pd
import numpy as np
from interpret import show
from interpret.data import Marginal
from interpret.glassbox import ExplainableBoostingRegressor, RegressionTree
from interpret.perf import RegressionPerf
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from yellowbrick.features import Rank2D
from yellowbrick.features import JointPlotVisualizer
from yellowbrick.regressor import ResidualsPlot
from pydqc import infer_schema, data_summary, data_compare, data_consist
print(f"numpy: {np.__version__}")
print(f"pandas: {pd.__version__}")
print(f"sklearn: {sklearn.__version__}")
print(f"interpret: {interpret.__version__}")
print(f"yellowbrick: {yellowbrick.__version__}")
eda_dir = Path(".")
data_path = eda_dir / "Sample Dataset.xlsx"
assert data_path.exists() == True
df = pd.read_excel(data_path)
pd.concat([df.dtypes, df.head(1).T, df.tail(1).T], axis=1)
df.describe()
Comments
- Date looks like the obvious index
- Why is the adj close so far from the close?
- Why is the signal 0 for last entry? Is zero a valid value?
- Why is the max close greater than the max high?
df['Date'].value_counts().max()
df[['Close', 'Adj Close']].plot(figsize=(16,8))
(df['Close'] - df['Adj Close']).hist(figsize=(16,6), bins=[0, 2, 5, 10, 15, 20, 30])
(df['Close'] - df['Adj Close']).describe()
Comments
- This looks weird, but I'm not sure what's going on. I'm not comfortable using Adj Close until this is understood
df['Signal'].hist(figsize=(12,6))
df['Signal'].describe()
df[df['Signal'] == 0.0].index
Comments
- looks like an issue in signal calculation at for the last 6 rows. we should inform the signal provider and drop them for now. we could also fill forward the last real signal
df[df['Signal'] == df['Signal'].max()].index
max_signal_iloc = df[df['Signal'] == df['Signal'].max()].index.values[0]
df.iloc[max_signal_iloc - 1: max_signal_iloc + 2]
Comments
- looks like the max is an error. we should inform the signal provider and drop them for now. we could also fill forward the last real signal
df['Close'].hist(figsize=(12,6))
max_close_iloc = df[df['Close'] == df['Close'].max()].index.values[0]
df.iloc[max_close_iloc - 1: max_close_iloc + 2]
Comments
- looks like the max close is an error... I'd recommend setting it to next days open so we can test it as a response variable
df['Open'].hist(figsize=(12,6))
df[~(df['Close'] <= df['High'])]
df[~(df['Open'] <= df['High'])]
df[~(df['Close'] >= df['Low'])]
df[(df['Close'] == 139.110001)]
df[(df['Close'] == 139.110001) & ~(df['Close'] >= df['Low'])]
Comments
- Close data looks to have been filled forward for several dates with 139.110001
- To use close as response variable I'd recommend filling it with the next days open for these instances
df[~(df['Open'] >= df['Low'])]
df[~(df['High'] >= df['Low'])]
Comments
- High and Low data doesn't look reliable, but if we don't plan to use it we can disregard this for now
- The easiest HACK is to set Low = Open in these cases
df = df.sort_values(by='Date')
# shift open price forward 1 day
df['Open +1D'] = df['Open'].shift(1)
# Add Return Columns for a Stationary Response
df['Open +1D Return'] = df['Open +1D'].pct_change()
df['Close Return'] = df['Close'].pct_change()
# drop last 6 zero rows from signal
df = df[df['Signal'] != 0.0]
# drop highest value from Signal
df = df[df['Signal'] != df['Signal'].max()]
# fill the highest value of Close with the next mornings open
max_close_iloc = df[df['Close'] == df['Close'].max()].index.values[0]
df.loc[max_close_iloc, 'Close'] = df.loc[max_close_iloc+1, 'Open']
# adjusting the period where close is filled forward with the next days open price instead
df.loc[(df['Close'] == 139.110001) & ~(df['Close'] >= df['Low']), 'Close'] = df.loc[(df['Close'] == 139.110001) & ~(df['Close'] >= df['Low']), 'Open +1D']
# set date as index
df = df.set_index('Date')
x_1, y_1, y_2 = df['Signal'], df['Close'], df['Open +1D']
y_2 = y_2.dropna()
x_2 = x_1[y_2.index]
train_x_1, dev_x_1, train_y_1, dev_y_1 = train_test_split(x_1, y_1, test_size=0.2, shuffle=False)
train_x_2, dev_x_2, train_y_2, dev_y_2 = train_test_split(x_2, y_2, test_size=0.2, shuffle=False)
visualizer = JointPlotVisualizer(size=(1080, 720))
visualizer.fit_transform(x_1, y_1)
visualizer.show()
visualizer = JointPlotVisualizer(size=(1080, 720))
visualizer.fit_transform(x_2, y_2)
visualizer.show()
visualizer = ResidualsPlot(LinearRegression(),size=(1080, 720))
visualizer.fit(train_x_1.values.reshape(-1, 1), train_y_1.values.reshape(-1, 1))
visualizer.score(dev_x_1.values.reshape(-1, 1), dev_y_1.values.reshape(-1, 1))
visualizer.show()
visualizer = ResidualsPlot(LinearRegression(),size=(1080, 720))
visualizer.fit(train_x_2.values.reshape(-1, 1), train_y_2.values.reshape(-1, 1))
visualizer.score(dev_x_2.values.reshape(-1, 1), dev_y_2.values.reshape(-1, 1))
visualizer.show()
ebm = ExplainableBoostingRegressor()
ebm.fit(train_x_2.to_frame(), train_y_2)
rt = RegressionTree()
rt.fit(train_x_2.to_frame(), train_y_2)
ebm_perf = RegressionPerf(ebm.predict).explain_perf(dev_x_2.to_frame(), dev_y_2, name='Boosting Tree')
rt_perf = RegressionPerf(rt.predict).explain_perf(dev_x_2.to_frame(), dev_y_2, name='Regression Tree')
show(rt_perf)
show(ebm_perf)
ebm_global = ebm.explain_global(name='Boosting Tree')
rt_global = rt.explain_global(name='Regression Tree')
show(ebm_global)
show(rt_global)