In this research, we try to find relationship between collision and several factors. As a result, accident numbers is considered to be outcome (dependent variable), exposure of interest below are considered to be independent variables. Multiple linear regression is conducted to testify if correlation between outcome and predictors is significant.
collisionnumber: accident number at specific borough, date and time
daytime: According to analysis in overview, frequency of accident and injured people in the period of 8 am-20 pm is higher than other period of time. So we separate 24 hours into 2 catagories:use 1 to represent 8 am-20 pm (12 hours), 0 to represent elsewhere (12 hours)
borough: There are five different boroughs, Bronx, Brooklyn, Manhattan, Queens and Staten Island.
num_light: numbers of unfinished street light services
num_signal: number of nufinished traffic signal services
weathertype: weather type (fog, haze, mist, rain, snow and sunny)
prep: percipitation
vehicle: There are six kinds of vehicle catagories(passenger vehicle, sedan, sport utility vehicle, truck, taxi and others). We counts the vehicle catagory that have greatest number of cars at specific time, borough and date
holiday: use 1 to represent the day that is a holiday and weekend, use 0 to represent rest circumstance.
cor_data =
cor(cbind(collisionnumber = pull(boro_daytime_weather_light_vt_hol,collisionnumber),
model.matrix(collisionnumber ~ borough + daytime + weathertype + num_light+ num_signal + vehicle + holiday +prep, boro_daytime_weather_light_vt_hol)[,-1])
)
cor_data %>%
corrplot(method = "color", addCoef.col = "black", tl.col = "black", tl.srt = 45, insig = "blank" , number.cex = 0.7, diag = FALSE)
Correlation between most variables is acceptable, but relationship between weathertype sunny and weathertype rain is above 70%, which indicate collinearity. When establishing model, we should always know that weathertype(sunny) shows most of information that weathertype(rain) contains.
fit2 = lm(collisionnumber ~ borough + factor(daytime) + weathertype + num_light + num_signal + factor(prep) + vehicle + holiday, data = boro_daytime_weather_light_vt_hol)
MASS::boxcox(fit2)
Box-Cox method applies a transformation by raising Y to different power, as we can see above, λ is close to 0, so we need to do natural logarithm transformation, turn Y into ln(Y).
boro_daytime_weather_light_vt_hol = boro_daytime_weather_light_vt_hol %>%
mutate(ln_collisionnumber = log(collisionnumber, base = exp(1)))
fit2 = lm(ln_collisionnumber ~ borough + factor(daytime) + weathertype + num_light + num_signal + factor(prep) + vehicle + holiday, data = boro_daytime_weather_light_vt_hol)
summary(fit2) %>%
broom::tidy() %>%
knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 2.1069024 | 0.1246978 | 16.8960648 | 0.0000000 |
boroughBROOKLYN | 0.7095696 | 0.0237988 | 29.8153287 | 0.0000000 |
boroughMANHATTAN | 0.3125737 | 0.0210182 | 14.8715425 | 0.0000000 |
boroughQUEENS | 0.5748078 | 0.0403652 | 14.2401684 | 0.0000000 |
boroughSTATEN ISLAND | -1.4204219 | 0.0194501 | -73.0292005 | 0.0000000 |
factor(daytime)1 | 1.2576627 | 0.0108625 | 115.7805056 | 0.0000000 |
weathertypehaze | 0.1887726 | 0.1089203 | 1.7331251 | 0.0831588 |
weathertypemist | 0.0908404 | 0.1062359 | 0.8550817 | 0.3925626 |
weathertyperain | 0.1375273 | 0.1028581 | 1.3370581 | 0.1812879 |
weathertypesnow | 0.1427384 | 0.1055153 | 1.3527743 | 0.1762124 |
weathertypesunny | 0.1293622 | 0.1030361 | 1.2555038 | 0.2093771 |
num_light | 0.0000613 | 0.0000634 | 0.9673223 | 0.3334477 |
num_signal | 0.0000139 | 0.0000232 | 0.5984272 | 0.5495925 |
factor(prep)1 | 0.0367187 | 0.0155665 | 2.3588220 | 0.0183860 |
vehiclepassenger vehicle | 0.2480838 | 0.0665297 | 3.7289178 | 0.0001953 |
vehiclesedan | 0.3523761 | 0.0660694 | 5.3334211 | 0.0000001 |
vehiclesport utility vehicle | 0.2814519 | 0.0667477 | 4.2166519 | 0.0000254 |
vehicletaxi | 0.4331257 | 0.0833836 | 5.1943774 | 0.0000002 |
vehicletruck | -0.5312716 | 0.1587091 | -3.3474554 | 0.0008239 |
holiday1 | -0.2011300 | 0.0229079 | -8.7799364 | 0.0000000 |
Above is result of multiple linear regression, it indicates that effect of borough is significant, BRONX works as reference, and all four indicator variables’ p-value is much smaller than 0.05. Brooklyn, Manhattan and Queens have a signifcantly greater number of accidents, while Staten Island has a signifcantly lower number of collision.
Besides that, p-value of “factor(daytime)1” is below 0.05. So this indicator variable is significant. Comparing with night, there are more accidents during the day.
p-value of num_light and num_signal is below 0.05. As for street light and traffic signal, their influence is not strong enough, people always believe that the presence of street light and traffic signal can reduce fatal road crash, which seems to be not reasonable statistically.
p-value of “factor(prep)1” is smaller than 0.05. Increasing of rainfall will significantly raise the number of accidents. However, the influence of weather is not significant.
The effect of “vehicle” is significant. “others” works as reference variables. All five indicator variables “vehiclepassenger vehicle”, “vehiclesedan”, “vehicletruck”, “vehicletaxi” and “vehiclesport utility vehicle” are significant. Driving passenger vehicle, sedan and sport utility vehcle will have a higher risk of involving in accidents, while driving truck can reduce the risk of accidents.
Last but not least, the influence of “holiday1” is also significant. It means that during holiday and weekend, there will be fewer accidents.
par(mfrow = c(2,2))
plot(fit2)
The four graph above prove the homoscedasticity of residual. The variance of residual does not change when fitted value changes. And it is also proved that residual’s expected value is 0. Through residual flutuates slightly around zero, it is still acceptable. Besides, residual is normality. No influential observation shows in the dataset. All assumption is satisfied.