Python - subtracting two variables in a regression without creating new variable

Question

Suppose I am regressing

y = x1 + x4

where x4 = x2 - x3

In R, there is a function I() such that I don't have to create a new variable x4 in my data set, but can simply write

y = x1 + I(x2 - x3)

See details here: What does the capital letter "I" in R linear regression formula mean?

Is there a similar way to do this in Python? For instance using statsmodels.formula.api or the sklearn

score 1 · Accepted Answer · answered Aug 31 '18 at 21:54

With statsmodels.formula.api you can use vectorized functions from numpy. To apply a subtraction, you may use np.subtract():

import numpy as np
import statsmodels.formula.api as smf
import pandas as pd

y = np.random.uniform(0, 20, size=100)
x1 = np.random.uniform(0, 20, size=100)
x2 = np.random.uniform(0, 20, size=100)
x3 = np.random.uniform(0, 20, size=100)
x = np.stack([y, x1, x2, x3], axis=1)
df = pd.DataFrame(x)
df.columns = ["y", "x1", "x2", "x3"]

fit = smf.ols(formula="y~x1+np.subtract(x2, x3)", data=df).fit()
print(fit.summary())

(The example data does obviously not make sense and leads to a regression with R squared almost zero, but it shows how it works.)

Python - subtracting two variables in a regression without creating new variable

1 Answers1