- Logistic regression (logit) represents example of generalized linear model created by introducing logistic function as dependency between explanatory and response variables
- Logit models fail into a broader class of qualitative response regression models (the dependent variable is qualitative in nature)
- A lot of real-world estimation problems actually fail into the “qualitative” (rather than quantitative) category – most common example being event occurrence (or appropriate probability)
- Additionally – this setting directly maps to the problem of Web analytics regarding user behavior (for example – predicting whether user will ignore advertising)
- Essential read regarding qualitative variables : G.S.Maddala, Limited-Dependent and Qualitative Variables in Econometrics, Cambridge University Press, 1983 .
- Taxonomy of qualitative variables in Web Analytics ? Starting point : http://en.wikipedia.org/wiki/Web_analytics
- Classical example is regression analysis of user sessions – we observe explanatory variables (anything related to page that user was presented at the time) – and qualitative dependent variable – user action (click, skip, stop browsing, etc)
- The question is whether we can properly estimate qualitative variables using standard regression methods like OLS
- First shoot at this would be Linear Probability Model (LPM) – deriving Bernulli process probability as quantitative variable. Issues – residuals are not normal.

## Archive for July, 2010

### Logistic Regression | take #0

July 31, 2010### INFORMS DM Contest 2010 : dataset overview

July 31, 2010- 5922 rows
- 610 columns
- timestamp : 40182 – 40290 | 108 samples | 9 hours of trading data | no actual date information
- most of variables should represent stock prices with (open,high,low,last_price) values in 5-min intervals
- (OPEN = value at timestamp, HIGH = highest traded value in 5-min interval, LOW = lowest traded value in interval, LAST = last traded value at end of interval)
- for stock prices the following should hold :
*open(p(t+1)) = last_price(p(t)) + delta |*however – that is not always the case – dataset is filled with missing data in order to reflect the real-world trading scenario (missed measurements, data loss, etc) - some variables are categorical/logical (open interpretation on what these actually represent) [additional question is – should they be added to the model]
- data might not be available for each variable at each time sample
- first shoot at the data might indicate that we should go for basic time-series analysis
- big question is price formation dynamics due to price correlation (any-to-all stocks regression modeling or any-to-(time?)-correlated stocks model)
- handling missing data will be essential for getting high-performance predictions (methods ?)
- offtopic : financial time series similarity detection ? pattern matching etc. “find similar stocks”, metrics ?
- clustering algorithms for time series datasets ?

### morning snapshot | 31 july 2010

July 31, 2010- Just discovered mathbin:
- Student t-distribution instead of Gaussian in loss function for small-sample regression scenarios
- http://kaggle.com/informs2010 – INFORMS Data mining contest – predicting short-term movements in stock prices (testing short-term market efficiency ?). 5-minute-sampled 6k-entries dataset. Total of 609 columns representing (open/high/low/last_price) for each stock in 5-min window.
- http://i2pi.com/rez/ml_talk/ml_demo.R – quick demonstration of several ML techiques in R

### Paul Wilmott et al : the math of financial derivatives | 45 min redux

July 30, 2010- A classical piece (1995) – focus on continuous side of financial math (SDEs etc.)
- Basic intro – options, real world data, interest rates, continuous compounding, present value
- Random walk nature of financial time series – discrete vs continuous walk – SDE representation
- We base derivative strategies around time series / stochastic process statistics rather than estimation
- Enter Ito’s lemma – description based on Taylor series expansion : ignore stochastic nature of function, get Taylor expansion of function value change due to delta-change in single function parameter, get differential of sde-representation of function, big-O analysis and series approximation, get approximated value back to Taylor expansion and voila ! – we have the relation of small change in function of random variable to the small change in the variable itself
- Consequence of Ito’ lemma – we can relate change in dependent variable (which we can’t observe directly) to the change in observed dependent variable (statistically – these two variables should be perfectly correlated)
- Obvious application : determining change in option price based on change in stock price (in general – this can be abstracted to any similar problem)
- We can generalize Ito’s lemma by introducing time-dependency of function (that is – multivariate dependency)
- Finally – we can derive probability distributions of variables and use standard probability toolkit to relate value range of variables with appropriate probabilities

### time series forecasting #1

July 30, 2010- marginalization of regression variables and observation as time series
- time domain vs frequency domain methods for time series analysis
- spectral analysis of time series – frequency domain modeling – useful in detecting cyclic activity
- voidbase project – frequency-domain queue analysis
- case for voidbase vs existing open source complex event processing solutions (Esper/Cayuga) – focus of voidbase is on ease of abstraction of traditional online data sources, simple framework for online algorithm development – and in general more exploratory focus aimed at online portal / search engine market (+ not so mach CEP as just interactive real-time time series econometrics)

### algorithms review : tries | take #1

July 29, 2010http://en.wikipedia.org/wiki/Trie

- Tries represent prefix trees
- String search in O(log(n)) is naturally obtained by adding strings to trie
- Common case for key-value stores is “get all keys by prefix” – Trie is a natural way to achieve this. Can this be mapped to key-value store ? Indeed – we map each tree node to a entry in hashmap and store keys representing child nodes in that key’s content.
*HashTrie*– popular extension of key-value stores -> update is O(len(key)), retrieval is O(len(key)). The main problem is due to the fact that we need to update key *content* in order to maintain the tree – which makes*ConcurrentHashTree*relatively complex to design- How to iterate
*HashTrie*? – would require having a master-key that would act as top-level node and contain keys of all first-level child keys *SortedHashTrie*?

### statistics review : marginal regression | take #0

July 29, 2010Popular concept often used in high-dimensional statistics is the notion of *marginal regression* – which essentially means regression on marginal variables. By *marginal variable* we assume any variable that is obtained by performing arithmetic operations on row/column data in original dataset.

### Bagel’s Park blues

July 29, 2010Among the narrow streets of Belgrade’s downtown, just across the Terazije tunnel – a small treasure is hidden. If you’re in the mood for pesto chicken bagels, fresh orange juice and a bit of espresso – it’s the best way to start your morning routine. Free wifi, large tables and just a bit of street schizophrenia – it’s just about the spot for the tech people to bootstrap their ideas.

### morning snapshot | 29 July 2010

July 29, 2010- http://dev.twitter.com/pages/streaming_api – Twitter Streaming API | Looking forward to the public version – amazing source for voidbase
- http://www.crunchgear.com/2010/07/28/amazon-reveals-new-kindle-139-for-wi-fi-version/ – Kindle at $139. The popular rage of early users ? Money-back to early users ? Used electronic devices market vs used cars market ?

### bootstrap on&on …

July 29, 2010Here we go again. Blog-leapfrogging brings us to wordpress. Apparently I have just learned that blogger is obsolete – and I’m already using local wp on voidsearch blog => here we go 🙂 More random jabbering to myself & friends. Hooray !