Guide: Running a Binary Logistic Regression Model with Julius

alumpe · April 12, 2024, 3:53pm

The Data

We obtained publicly available data from the state education agency on school principals. Turnover was recorded as binary yes or no for a period between two academic years. The data set also included school type, race/ethnicity, gender, base salary, and total educational experience in years. We treated turnover as the dependent variable and wanted to know if the available demographic variables were predictors of turnover. Below is a screenshot of a portion of the data. There were over 1,200 school principals in the data set.

We wanted to make sure our prompts to Julius were clear and comprehensive as we notice that Julius will not automatically produce everything associated with a model. We asked it to check specific model assumptions. We identified the dependent and predictor variables. We asked for descriptive statistics and model fit. We also asked for an interpretation of the findings. Below are the prompts we used and the outputs. We also ran the same model in SPSS to check for accuracy and the models were very similar.

User: Run a binary logistic regression model. Check the following logistic model assumptions: 1. There must be no perfect multicollinearity among independent variables. 2. Continuous predictors are linearly related to a transformed version of the outcome (linearity). The dependent variable is job turnover. The predictor variables are school type, race/ethnicity, gender, base salary in dollars, and total experience in years. The final model report should include the following: Descriptive statistics on the outcome variable and each of the predictors. Information on which assumptions were checked and whether they were met. A statement about model significance. A statement about model fit. The model estimates include ORs and their 95% CIs. An interpretation of the findings. Show the percentage of variance explained using R squared. Show the relative contribution of each predictor variable.

Julius:

The dataset from the ‘Sheet1’ of the ‘turnover.xlsx’ file has been successfully loaded. It contains the following columns: school type, race/ethnicity, gender, base salary in dollars, total experience in years, and job turnover. Here’s a preview of the data:

Note: Julius is excellent at reading data and understanding variables.

Now, I’ll proceed with the binary logistic regression model analysis, including checking the model assumptions, descriptive statistics, model significance, fit, estimates, and interpretation of the findings. Let’s start with the descriptive statistics of the outcome variable and each of the predictors.

The descriptive statistics for the numerical variables (base salary in dollars and total experience in years) and the outcome variable (job turnover, encoded as 0 for ‘NO’ and 1 for ‘YES’) are as follows:

For the categorical variables (school type, race/ethnicity, gender), the descriptive statistics are:

Next, I’ll check the logistic model assumptions, starting with testing for multicollinearity among the independent variables and then assessing the linearity of continuous predictors with the logit of the outcome.

The Variance Inflation Factor (VIF) results for checking multicollinearity among the independent variables are as follows:

The VIF for most variables is below 10, which is generally considered acceptable, indicating no severe multicollinearity. However, the ‘Intercept’ and some categories within ‘school type’ have higher VIF values, suggesting potential multicollinearity issues within these categorical variables.

When attempting to check the linearity assumption using the Box-Tidwell test, an error occurred due to missing data or infinite values in the transformed variables. This suggests that there might be zero or negative values in ‘base salary in dollars’ and ‘Total Experience in Years’ that, when logged, result in undefined values.

To proceed, I’ll need to address this issue by either removing or imputing these problematic values. Let’s first inspect the data for any zero or negative values in these columns and decide on the next steps.

There are rows with zero values in the ‘Total Experience in Years’ column, which caused the error during the linearity assumption check. Here are the problematic rows:

To proceed with the linearity assumption check, we have a few options: - Remove these rows from the dataset. - Impute the zero values with a suitable statistic (e.g., the mean or median of ‘Total Experience in Years’).

Note: Julius found some missing data and asked us how we’d like to handle it.

Please let me know how you’d like to proceed.

User: Would you like to proceed with removing the rows with zero values in ‘Total Experience in Years’ or imputing the zero values with a suitable statistic?

Julius: Let’s proceed by removing the rows with zero values in ‘Total Experience in Years’ to maintain the integrity of the data for the logistic regression analysis. This approach will allow us to accurately assess the linearity assumption without artificially altering the data distribution. I’ll remove these rows and then re-attempt the linearity assumption check.

The linearity assumption check using the Box-Tidwell test has been successfully completed after removing rows with zero values in ‘Total Experience in Years’. Here are the results of the logistic regression model:

The model includes the log-transformed ‘base salary in dollars’ and ‘Total Experience in Years’ as predictors. - Both predictors are statistically significant, as indicated by their p-values (P<0.05). - The coefficient for ‘base salary in dollars log’ is positive, suggesting that as the base salary increases, the likelihood of job turnover increases. - The coefficient for ‘Total Experience in Years log’ is negative, indicating that as total experience increases, the likelihood of job turnover decreases.

Next, I’ll proceed with the full logistic regression model, including all predictors, to evaluate model significance, fit, estimates (ORs and their 95% CIs), and the relative contribution of each predictor variable.

The full logistic regression model analysis has been completed. Here’s a summary of the model results:

The model includes predictors such as school type, race/ethnicity, gender, log-transformed base salary, and log-transformed total experience. - The model’s pseudo R-squared value is 0.04257, indicating a modest fit. - The log-transformed base salary and total experience are statistically significant predictors of job turnover, with p-values less than 0.05. - The coefficients for other variables, including school type, race/ethnicity, and gender, are not statistically significant at the 0.05 level.
The Odds Ratios (ORs) and their 95% Confidence Intervals (CIs) for each predictor are as follows:

The OR for log-transformed base salary is 0.337, indicating that for each unit increase in the log-transformed base salary, the odds of job turnover decrease by approximately 66.3%. - The OR for log-transformed total experience is 0.429, suggesting that for each unit increase in log-transformed total experience, the odds of job turnover decrease by about 57.1%. - The ORs for school type, race/ethnicity, and gender indicate their relative effects on the likelihood of job turnover, though these effects are not statistically significant.

This analysis provides insights into the factors influencing job turnover, highlighting the importance of salary and experience. If you have any further questions or need additional analysis, please let me know.

Note: The model showed two significant predictors of turnover and Julius provides interpretation information about the nature of the relationships.

zach · April 12, 2024, 9:46pm

Great guide! This is really useful. I did have one question: you said you ran the same model in SPSS — are you able to compare those workflows? In what ways do you find Julius more or less valuable than using SPSS?

alumpe · April 15, 2024, 3:50pm

Thanks! In SPSS, the workflow is a bit different in terms of uploading data and choosing the model parameters as SPSS has pull down menus for tests. I found Julius more intuitive in that I could give it commands in everyday language and the bot interprets what I’m asking. I’m not sure I’d call it more valuable but more natural. I did need to get out ducks in a row prior to both using SPSS and Julius having to know what kind of tests to run, model assumptions, etc. One of the main differences is that SPSS is very expensive and out of reach from many. Another feature of Julius that may be beneficial is that you can upload data and ask it for the best tests to run and it will offer suggestions. This is one thing I’d like to test more in the future.

Topic		Replies	Views
Accuracy checking and hallucinations General	2	410	April 4, 2024
Guide: Building Simple Machine Learning Models with Julius General	2	883	April 7, 2024
Guide: Data Transformation and Linear Regression Guides	0	192	August 30, 2024
Guide: Conducting a Chi Square Test for Non-Parametric, Categorical Data General	3	245	April 16, 2024
Which model should I use for data analysis within Julius General	3	620	October 23, 2024

Guide: Running a Binary Logistic Regression Model with Julius

Related topics