Archive for July, 2010

Logistic Regression | take #0

July 31, 2010
  • Logistic regression (logit) represents example of generalized linear model created by introducing logistic function as dependency between explanatory and response variables
  • Logit models fail into a broader class of qualitative response regression models (the dependent variable is qualitative in nature)
  • A lot of real-world estimation problems actually fail into the “qualitative” (rather than quantitative) category – most common example being event occurrence (or appropriate probability)
  • Additionally – this setting directly maps to the problem of Web analytics regarding user behavior (for example – predicting whether user will ignore advertising)
  • Essential read regarding qualitative variables : G.S.Maddala, Limited-Dependent and Qualitative Variables in Econometrics, Cambridge University Press, 1983 .
  • Taxonomy of qualitative variables in Web Analytics ? Starting point :
  • Classical example is regression analysis of user sessions – we observe explanatory variables (anything related to page that user was presented at the time) – and qualitative dependent variable – user action (click, skip, stop browsing, etc)
  • The question is whether we can properly estimate qualitative variables using standard regression methods like OLS
  • First shoot at this would be Linear Probability Model (LPM)  – deriving Bernulli process probability as quantitative variable.  Issues – residuals are not normal.

INFORMS DM Contest 2010 : dataset overview

July 31, 2010
  • 5922 rows
  • 610 columns
  • timestamp : 40182 – 40290 | 108 samples | 9 hours of trading data | no actual date information
  • most of variables should represent stock prices with (open,high,low,last_price) values in 5-min intervals
  • (OPEN = value at timestamp, HIGH = highest traded value in 5-min interval, LOW = lowest traded value in interval, LAST = last traded value at end of interval)
  • for stock prices the following should hold : open(p(t+1)) = last_price(p(t)) + delta | however – that is not always the case – dataset is filled with missing data in order to reflect the real-world trading scenario (missed measurements, data loss, etc)
  • some variables are categorical/logical (open interpretation on what these actually represent) [additional question is – should they be added to the model]
  • data might not be available for each variable at each time sample
  • first shoot at the data might indicate that we should go for basic time-series analysis
  • big question is price formation dynamics due to price correlation (any-to-all stocks regression modeling or any-to-(time?)-correlated stocks model)
  • handling missing data will be essential for getting high-performance predictions (methods ?)
  • offtopic : financial time series similarity detection ? pattern matching etc. “find similar stocks”, metrics ?
  • clustering algorithms for time series datasets ?

morning snapshot | 31 july 2010

July 31, 2010
  • Just discovered mathbin
  • Student t-distribution instead of Gaussian in loss function for small-sample regression scenarios
  • – INFORMS Data mining contest – predicting short-term movements in stock prices (testing short-term market efficiency ?). 5-minute-sampled 6k-entries dataset. Total of 609 columns representing (open/high/low/last_price) for each stock in 5-min window.
  • – quick demonstration of several ML techiques in R

Paul Wilmott et al : the math of financial derivatives | 45 min redux

July 30, 2010
  • A classical piece (1995) – focus on continuous side of financial math (SDEs etc.)
  • Basic intro – options, real world data, interest rates, continuous compounding, present value
  • Random walk nature of financial time series – discrete vs continuous walk – SDE representation
  • We base derivative strategies around time series / stochastic process statistics rather than estimation
  • Enter Ito’s lemma – description based on Taylor series expansion : ignore stochastic nature of function, get Taylor expansion of function value change due to delta-change in single function parameter, get differential of sde-representation of function, big-O analysis and series approximation, get approximated value back to Taylor expansion and voila ! – we have the relation of small change in function of random variable to the small change in the variable itself
  • Consequence of Ito’ lemma – we can relate change in dependent variable (which we can’t observe directly) to the change in observed dependent variable (statistically – these two variables should be perfectly correlated)
  • Obvious application : determining change in option price based on change in stock price (in general – this can be abstracted to any similar problem)
  • We can generalize Ito’s lemma by introducing time-dependency of function (that is – multivariate dependency)
  • Finally – we can derive probability distributions of variables and use standard probability toolkit to relate value range of variables with appropriate probabilities

time series forecasting #1

July 30, 2010
  • marginalization of regression variables and observation as time series
  • time domain vs frequency domain methods for time series analysis
  • spectral analysis of time series – frequency domain modeling – useful in detecting cyclic activity
  • voidbase project – frequency-domain queue analysis
  • case for voidbase vs existing open source complex event processing solutions (Esper/Cayuga) – focus of voidbase is on ease of abstraction of traditional online data sources, simple framework for online algorithm development – and in general more exploratory focus aimed at online portal / search engine market (+ not so mach CEP as just interactive real-time time series econometrics)

algorithms review : tries | take #1

July 29, 2010

  • Tries represent prefix trees
  • String search in O(log(n)) is naturally obtained by adding strings to trie
  • Common case for key-value stores is “get all keys by prefix” – Trie is a natural way to achieve this. Can this be mapped to key-value store ? Indeed – we map each tree node to a entry in hashmap and store keys representing child nodes in that key’s content.
  • HashTrie – popular extension of key-value stores -> update is O(len(key)), retrieval is O(len(key)). The main problem is due to the fact that we need to update key *content* in order to maintain the tree – which makes ConcurrentHashTree relatively complex to design
  • How to iterate HashTrie ? – would require having a master-key that would act as top-level node and contain keys of all first-level child keys
  • SortedHashTrie ?

statistics review : marginal regression | take #0

July 29, 2010

Popular concept often used in high-dimensional statistics is the notion of marginal regression – which essentially means regression on marginal variables. By marginal variable we assume any variable that is obtained by performing arithmetic operations on row/column data in original dataset.

Bagel’s Park blues

July 29, 2010

Among the narrow streets of Belgrade’s downtown, just across the Terazije tunnel – a small treasure is hidden. If you’re in the mood for pesto chicken bagels, fresh orange juice and a bit of espresso – it’s the best way to start your morning routine. Free wifi, large tables and just a bit of street schizophrenia – it’s just about the spot for the tech people to bootstrap their ideas.

morning snapshot | 29 July 2010

July 29, 2010

bootstrap on&on …

July 29, 2010

Here we go again. Blog-leapfrogging brings us to wordpress. Apparently I have just learned that blogger is obsolete – and I’m already using local wp on voidsearch blog => here we go 🙂 More random jabbering to myself & friends. Hooray !