INFORMS DM Contest 2010 : dataset overview

July 31, 2010
  • 5922 rows
  • 610 columns
  • timestamp : 40182 – 40290 | 108 samples | 9 hours of trading data | no actual date information
  • most of variables should represent stock prices with (open,high,low,last_price) values in 5-min intervals
  • (OPEN = value at timestamp, HIGH = highest traded value in 5-min interval, LOW = lowest traded value in interval, LAST = last traded value at end of interval)
  • for stock prices the following should hold : open(p(t+1)) = last_price(p(t)) + delta | however – that is not always the case – dataset is filled with missing data in order to reflect the real-world trading scenario (missed measurements, data loss, etc)
  • some variables are categorical/logical (open interpretation on what these actually represent) [additional question is – should they be added to the model]
  • data might not be available for each variable at each time sample
  • first shoot at the data might indicate that we should go for basic time-series analysis
  • big question is price formation dynamics due to price correlation (any-to-all stocks regression modeling or any-to-(time?)-correlated stocks model)
  • handling missing data will be essential for getting high-performance predictions (methods ?)
  • offtopic : financial time series similarity detection ? pattern matching etc. “find similar stocks”, metrics ?
  • clustering algorithms for time series datasets ?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: