Skip to content

Support for categorical variables in regression #38

@agisga

Description

@agisga

Categorical (as opposed to numeric) variables are ubiquitous in data analysis and linear regression, but they seem not to be supported by Statsample::Regression.
Here is an example of what I mean:

In R, I can do:

> head(fake.salaries)
      salary years ethnicity
1  5.0823594     9     black
2 -0.4459633     3     black
3 16.0734587     2     white
4 10.5554305     7     other
5  9.9438798     8     other
6  9.6776724     6    latino
> mod <- lm(salary ~ years + ethnicity, fake.salaries)
> summary(mod)

Call:
lm(formula = salary ~ years + ethnicity, data = fake.salaries)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5068 -1.1283 -0.3713  1.1227  3.3027 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.5421     0.9851   1.565    0.131    
years              0.1729     0.1561   1.108    0.279    
ethnicitylatino    6.7300     0.9984   6.741 5.67e-07 ***
ethnicitymexican   5.4826     0.8755   6.262 1.79e-06 ***
ethnicityother     6.6404     0.9034   7.351 1.37e-07 ***
ethnicitywhite    11.5310     0.9309  12.387 6.46e-12 ***

---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

Residual standard error: 1.66 on 24 degrees of freedom
Multiple R-squared:  0.8761,    Adjusted R-squared:  0.8503 
F-statistic: 33.95 on 5 and 24 DF,  p-value: 3.942e-10

We see that lm regards the variable "ethnicity" as a categorical variable and fits a model accordingly. We can see in the output that in this case it takes ethnicity "black" as the base level, and that all other ethnicities have a statistically significant effect on "salary" (with p-values of 1e-6 or smaller) when compared to the base level.

When I try to analyse the same data in Statsample:

pry(main)> df = Statsample::CSV.read("/home/alexej/Desktop/fake_salaries.csv")
=> #<Statsample::Dataset:69956503513460 @name=Dataset 1 @fields=[salary,years,ethnicity] cases=30
pry(main)> mod = Statsample::Regression.multiple(df, 'salary')
NoMethodError: NoMethodError
from /home/alexej/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/statsample-1.5.0/lib/statsample/vector.rb:186:in `_check_type'

So, "NoMethodError". And when I delete "ethinicity", the model can be fit:

pry(main)> df.delete_vector("ethnicity")
=> ["ethnicity"]
pry(main)> mod = Statsample::Regression.multiple(df, 'salary')
=> #<Statsample::Regression::Multiple::RubyEngine:0x007f4008733620
> puts mod.summary
= Multiple reggresion of years on salary
  Engine: Statsample::Regression::Multiple::RubyEngine
  Cases(listwise)=30(30)
  R=0.061
  R^2=0.004
  R^2 Adj=-0.032
  Std.Error R=4.358
  Equation=7.046 + 0.125years
  == ANOVA
    ANOVA Table
+------------+---------+----+--------+-------+-------+
|   source   |   ss    | df |   ms   |   f   |   p   |
+------------+---------+----+--------+-------+-------+
| Regression | 1.979   | 1  | 1.979  | 0.104 | 0.749 |
| Error      | 531.824 | 28 | 18.994 |       |       |
| Total      | 533.804 | 29 | 20.973 |       |       |
+------------+---------+----+--------+-------+-------+

  Beta coefficients
+----------+-------+-------+-------+-------+
|  coeff   |   b   | beta  |  se   |   t   |
+----------+-------+-------+-------+-------+
| Constant | 7.046 | -     | 2.233 | 3.155 |
| years    | 0.125 | 0.061 | 0.386 | 0.323 |
+----------+-------+-------+-------+-------+

This issue possibly allows for a common solution with SciRuby/statsample-glm#11 and SciRuby/daru#9.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions