We Have Numbers Of Free Samples

For Each Subject To Make A Difference In Your Grade

Anova – Data Analytics

Total Views 707

Abstract

Data analytics field enables decision makers to better understand data, detect trends, explore relationships and drive insights that will be useful to take an efficient decision In future from historical data. Data analytics is the process of analyzing a specific data which involves various manipulations on the selected data sets to draw meaningful conclusions.

The aim of this paper is to present some of the more important statistical techniques and highlight their uses based on two case studies in accidents and crime data.

Introduction

Statistics apply mathematics disciplines to analyze data. Most domains use data analysis to planning, analysis and interpret experimental work or historical data, especially for critical topics like crime and accident. Various existing studies have explored this type of data and look at the association between variables that capture attention to aid interpretation and presentation of results. Use the appropriate statistical method is essential to assess and make inferences in these critical topics; for example : the evaluation of the variation of Traffic Volume and Number of Accidents per Hour or the evaluation of the impact of round specificities in accident severity.

There is various statistic techniques and methods proposed in the data analysis field. Descriptive statistics and inferential statistics are among the main statistical branches of statistics which are employed in scientific analysis of data and both are equally important for researchers in statistics.

Descriptive statistics summarizes data from a population sample using indicators like mean, median, minimum, maximum values, standard deviation and/or coefficient of variation. This type of statistics deals with the capture and presentation of data and often constitutes the first part of a statistical analysis. The statistician needs to be aware of designing experiments in this step, choosing the right focus group and avoid biases.

The mean, or more precisely the arithmetic mean, is simply the arithmetic average of a group of numbers (or data set) and is shown using -bar symbol {displaystyle {bar {}}}. So the mean of the variable {displaystyle x}x is {displaystyle {bar {x}}}pronounced “x-bar”. It is calculated by adding up all of the values in a data set and dividing by the number of values in that data set.

However, inferential statistics allow us to draw the right conclusions from the statistical analysis that has been performed using descriptive statistics. Among the main used techniques we found Hypotheses testing and ANOVA.

Two characteristics of data sets must be considered prior to the application of any inferential tests :1-Do the data conform to the principles of normality . ie normal distribution of data (ND)? 2- Do the data satisfy an assumption of homoscedasticity, i.e. uniformity of variance?

Most predictions of the future and generalizations of results to averall population using a smaller sample come under the umbrella of inferential statistics. By designing the right experiment and using the right techniques, the researcher is able to draw relevant conclusions to his study.

Hypothesis Testing is a statistical hypothesis is an assumption made by the researcher about the population of data collected for any experiment. This assumption can be true or false. Hypothesis testing is, the formal way of validating the hypothesis made by the researcher and check whether the hypothesis is accepted or rejected.

In order to validate a hypothesis, it will consider the entire population which is not possible in practice. So, to validate a hypothesis, we use random samples from a population and on the basis of the result from testing over the sample data, we can either select or reject the hypothesis.

Statistical Hypothesis can be categorized into 2 types as below:

Null Hypothesis –Hypothesis tests are used to test the validity of a claim that is made about a population. This claim that’s on trial, in essence, is called the null hypothesis. The null hypothesis testing is denoted by H0.
Alternative Hypothesis –The alternative hypothesis is the one you would believe if the null hypothesis is concluded to be untrue. The evidence in the trial is your data and the statistics that go along with it. The alternative hypothesis testing is denoted by H₁or H_a.

Hypothesis testing uses a p-value to evaluate the strength of the evidence. The p-value is a number between 0 and 1 and interpreted in the following way:

A small p-value (typically ≤0.05) indicates strong evidence against the null hypothesis, so you reject it.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject it.
A p-value very close to the cutoff (0.05) is considered to be marginal and could go either way.

The Analysis of Variance (ANOVA)

Analysis of variances (ANOVA) is a parametric statistical technique that partitions the observed variance into components that arise from different sources of variation. In its simplest form, ANOVA technique provides a statistical test of whether or not the means of several groups are all equal. In this sense, the null hypothesis, H0, says there are no differences among results from different treatments or sample sets; the alternative hypothesis (Ha) is that the results do differ. If the null hypothesis is rejected then the alternative hypothesis, Ha, is accepted, i.e. at least one set of results differs from the others.

The ANOVA technique is generally used to compare the mean values of three or more data sets.

This study focus on the three kinds of statistic methods frequently used in this field as noted above, to practice and apply in two different case studies.

The purpose of this document is to present and apply three statistic techniques, then discuss the outcome of applying them to our datasets.

The reminder of this paper presents two case studies. In each case study, it presents the dataset structure, applies at least 2 statistical techniques on it and discuss the results of using these techniques and if one supported more than others.

1. Case study 1 : UK accidents data

Road accident is most unwanted thing to happen to a road user, over speeding, Red light Jumping or Avoiding Safety Gears like Seat belts and Helmets are some reasons of the common behaviors that generate fatal accidents and increase the degree of the severity of accident.

The United States is one of the busiest countries in terms of road traffic. The level of traffic is one of the reasons leading to more traffic accidents: In 2015, there were some 6.3 million fatal, injury, and property damage crashes that occurred in the U.S. alone.

The UK government amassed traffic data from 2000 and 2016, recording over 1.6 million accidents in the process and making this one of the most comprehensive traffic data sets out there. It’s a huge picture of a country undergoing change.

Data analysis of accidents data is the process of analyzing accidents data sets to derive useful conclusions and/or informations based on the approach presented in previously. We propose to analyze the impact of Junction control types at the number of vehicles impacted in accident using the explanatory variable Junction_control and check if there is a relation in the mean difference between junction control types. This case study uses three statistic techniques : mean analysis, Hypothesis testing and Analysis of Variance(ANOVA).

This case study purpose is to check if there is a statistically significant difference in variable “number of vehicles” between the two groups of “junction control” : “Giveway or uncontrolled” “Stop Sign“ in the accidents data.

Null Hypothesis (H0) : Junctio control of type “Giveway or uncontrolled” has the same impact like Junction control of type “Stop Sign “
Alternative Hypothesis : Junctio control of type “Giveway or uncontrolled” has not the same impact like Junction control of type “Stop Sign “

1.1. Data Preparation and Cleaning

All the contained accident data come from police reports, so this data does not include minor incidents. “UK road accidents from 2012 to 2014” file Contains data of UK accidents from 2012 to 2014.

Figure 1 : snapshot of accident data

Data variables:

- Unnamed: 0
- Accident_Index
- Location_Easting_OSGR
- Location_Northing_OSGR
- Longitude
- Latitude
- Police_Force
- Accident_Severity
- Number_of_Vehicles
- Number_of_Casualties
- Date
- Day_of_Week
- Time
- Local_Authority_(District)
- Local_Authority_(Highway)
- 1st_Road_Class
- 1st_Road_Number
- Road_Type
- Speed_limit

Junction_Detail
Junction_Control
2nd_Road_Class
2nd_Road_Number
Pedestrian_Crossing-Human_Control
Pedestrian_Crossing-Physical_Facilities
Light_Conditions
Weather_Conditions
Road_Surface_Conditions
Special_Conditions_at_Site
Carriageway_Hazards
Urban_or_Rural_Area
Did_Police_Officer_Attend_Scene_of_Accident
LSOA_of_Accident_Location
Year

The data analysis is conducted using R programming language. The main R libraries used for this analysis are ggplot2, readr and data.table.

After loading the main libraries, the R script checks the main statistics of each variable using the summary function. The results show that some variables and lines are not fully containing data which require cleaning operation and removing remaining rows with NAs values.

1.2. Exploratory data analysis

For Junction_Control variable is a categorical variable with exactly 178610 effective values of 4 categories as presented in the table below :

summary(accidents1214$Junction_Control)
Junction control	Giveway or uncontrolled	Automatic traffic signal	Stop Sign	Authorised person	NA
Number	232915	50208	2287	677	286087
Percentage	40.7%	8.8%	0.4%	0.1%	50.0%

286087 represents the NA values that we will remove from data in the next sections. Giveway or uncontrolled group have the top number of accidents by 40.7% and the stop sign group has the lowest with 0.4%

Figure 2 : box plot of number of vehicles by Junction control groups

The figure 2 presents the side by side box plot of junction control by number of vehicles. It summarizes a data set using important statistics (first quartile, median, third quartile and upper whisker) for comparing data across groups. This figure illustrates the repartition of number of vehicles in each group and shows that “Giveway or Uncontrlled” group produces a big number of impacted vehicles(max of 11 vehicles) than the other groups. The “Authorised person” group generates a maximum of 5 impacted vehicles.

Figure 3 : Number of vehicles histogram

Figure 3 presents the number of vehicles histogram that illustrates the density of the variable Number of vehicles. Most of accidents are registered for 2 vehicles (198105 accidents), followed by 1 vehicle (65251 accidents). We note also the presence of some outliers (59 accidents), an accident of 11 vehicles in group” Giveway or uncontrolled”.

1.3. Descriptive statistics using Mean

The mean, or more precisely the arithmetic mean, is simply the arithmetic average of a group of numbers (or data set) and is shown using -bar symbol {displaystyle {bar {}}}. So the mean of the variable {displaystyle x}x is {displaystyle {bar {x}}}pronounced “x-bar”. It is calculated by adding up all of the values in a data set and dividing by the number of values in that data set.

We propose to analyze the mean of Number_of_Vehicles using the independent variable Junction_control and check if there is a relation in mean difference between junction control types as presented in the table below.

# Calculate the mean Number_of_Vehicles for each speed limitmeanJunction_Control2 = accidents1214[, mean(Number_of_Vehicles), by=Junction_Control]
meanJunction_Control2# Create a bar plot of the Number_of_Vehicles
barplot(meanJunction_Control2$V1, ylim=c(0,3), names.arg=meanJunction_Control2$Junction_Control, main=”Number_of_Vehicles for Different Junction_Control”, xlab=”Junction_Control”, ylab=”Average Number_of_Vehicles (out of 3)”, xpd=FALSE)

Figure 4 : Number of vehicles per Junction control

Table 1 : mean and standard deviation of number of vehicles

Estimate parameter	Authorised person	Automatic traffic signal	Automatic traffic signal	Giveway or uncontrolled	Stop Sign
Mean	1.762813	1.748892	1.857214	1.870863	1.953214
Standard deviation	0.8570673	0.6404876	0.5964850	0.5938531	0.5324114

From Figure 4 and table 1, the calculated mean of “number of vehicles” are respectively 1.87 and 1.95 for “Giveway or uncontrolled” group and “Stop Sign” group. “Stop Sign” group is slightly bigger than “Giveway or uncontrolled” group. Even if the max of impacted vehicles is the highest in the group “Giveway or uncontrolled” the mean of “Stop Sign” group is the highest.

1.4. Hypothesis Testing

Hypothesis testing is, the formal way of validating the hypothesis defined above and check whether the hypothesis is accepted or rejected. Hypothesis testing uses a p-value to evaluate the strength of the evidence.

This section’s purpose is to check if there is a statistically significant difference in number of vehicles between junction control of type “Giveway or uncontrolled” and Junction control of type “Stop Sign“ in the accidents data :

Null Hypothesis (H0) : Junctio control of type “Giveway or uncontrolled” the same impact like Junction control of type “Stop Sign “
Alternative Hypothesis : Junctio control of type “Giveway or uncontrolled” has not the same impact like Junction control of type “Stop Sign “

Group	Giveway or uncontrolled	Stop Sign
Number	232915	2287
Percentage	40.7%	0.4%

Apply Student’s t-test in R using accidents data

The Student’s t-test is a method for comparing two samples. It can be implemented to determine whether the samples are different. This is a parametric test, and the data should be normally distributed. R can handle the various versions of t-test using the t.test() command. The test can be used to deal with two- and one-sample tests as well as paired tests.

“`{r}

#Using the Student’s t-test in R

accidents1214_2types=accidents1214[Junction_Control==’Giveway or uncontrolled’ | Junction_Control==’Stop Sign’]

t.test(accidents1214_2types$Number_of_Vehicles ~ accidents1214_2types$Junction_Control)

Results :

Welch Two Sample t-test

data: accidents1214_2types$Number_of_Vehicles by accidents1214_2types$Junction_Controlt = -7.3522, df = 2342.2, p-value = 2.681e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.10431566 -0.06038645
sample estimates:
mean in group Giveway or uncontrolled mean in group Stop Sign 1.870863 1.953214

The P-value of t-test statistic for mean comparison of number of vehicles committed in “Giveway or uncontrolled” and “Stop Sign” is less than the specify α-level (0.05). Hence the number of impacted vehicles in “Giveway or uncontrolled” and “Stop Sign” had significant different.

1.5. ANOVA

This section’s purpose is to check using ANOVA technique, if there is a statistically significant difference in number of vehicles between junction control of type “Giveway or uncontrolled” and Junction control of type “Stop Sign“ in the accidents data

The P-value analysis of variance for the number of vehicles occurred in different kind of Junction control less than our α-level (0.05) of significant. Then, analysis concluded that the type of Junction control had significant effect on the number of impacted vehicles.

l<-aov(Number_of_Vehicles ~ Junction_Control, data = accidents1214)

summary(l);

model.tables(l)

plot(l)

Results :

Df Sum Sq Mean Sq F value Pr(>F)

Junction_Control 3 34 11.254 31.9 <2e-16 ***

Residuals 286083 100928 0.353

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Tables of effects

Junction_Control

Authorised person Automatic traffic signal Giveway or uncontrolled Stop Sign

-0.1199 -1.162e-02 2.026e-03 8.438e-02

rep 677.0000 5.021e+04 2.329e+05 2.287e+03

The p(<2e-16 ) value is lower than 0.05. The result of the global comparison is thus obtained: the effect of Junction_Control is significant at the 0.001 threshold (F = 625.2).

The model.tables (l) function returns the sizes of effect, ie, here, the differences between each group and the overall average (Note that this result could have been obtained by tapply function too.

The plot (l) function provides different graphs diagnosing the validity of the model (homogeneity of intra-group variances, normal distribution of residues,etc …) among them, we illustrate below : Fitted values versus Residuals plot and Normal –Q plot.

plot(l)

Discussion

The aim of this first part of the study was to identify the effect of the Junction control on the number of vehicles, in road accidents at US based on data obtained from the police records of the accident cases. Methods used for data analysis were hypothesis testing using t-test and ANOVA.

Both methods hypotheses testing using t-test and ANOVA confirm that the number of impacted vehicles of road traffic accidents is associated with the type of junction control, especially between “Giveway or uncontrolled” and “Stop Sign”. “Stop Sign” group mean is slightly greater than “Giveway or uncontrolled” group. Even if the max of impacted vehicles is the highest in the group “Giveway or uncontrolled” the mean of “Stop Sign” group still the highest.

Conclusion

The overall findings of this study indicate that the number of impacted vehicles of road traffic accidents was associated with the type of junction control, especially between “Giveway or uncontrolled” and “Stop Sign”. Even if there is a high rate in impacted number of vehicles in group “Giveway or uncontrolled” the mean of of “Stop Sign” still slightly the highest. The t-test and ANOVa method used in this case study have revealed a difference in means between these groups, however, the degree of impact and the degree of impact is not clear. There is another statistics techniques like regression or decision trees that can help to identify the type and degree of impact

1. Case study 2 : US crime data

1.1. Data preparation and cleaning

“Crime_Data_2010_2017.csv” file Contains data of crimes from 2010 to 2017 with 26 variables and 1584316 observations. Below is a snapshot of the selected dataset.

The data analysis of the second case study is conducted using R programming language too. The main R libraries used for this analysis are ggplot2, readr and data.table. To explore data and check main statistics of each variables we have used the R summary function.

summary(crimes)
DR Number	Date.Reported	Date.Occurred	Time.Occurred
Min. : 210	08/07/2017: 804	01/01/2010: 2135	Min. : 1
1st Qu.:112117618	07/24/2017: 782	01/01/2011: 2050	1st Qu.: 930
Median :140109488	07/05/2016: 760	01/01/2012: 1641	Median :1430
Mean :135807195	07/17/2017: 757	01/01/2013: 1455	Mean : 1364
3rd Qu.:152018611	05/22/2017: 754	01/01/2014: 1319	3rd Qu.:1900
Max. :910220366	06/05/2017: 749	01/01/2015: 1240	Max. :2359
	(Other) :1579710	(Other) :1574476

Area.ID	Area.Name	Reporting.District	Crime.Code
Min. : 1.00	77th Street: 110605	Min. : 100	Min. :110
1st Qu.: 6.00	Southwest : 102259	1st Qu.: 645	1st Qu.:330
Median :12.00	N Hollywood: 86405	Median :1204	Median :440
Mean :11.15	Pacific : 83763	Mean :1162	Mean :507
3rd Qu.:16.00	Southeast : 83517	3rd Qu.:1676	3rd Qu.:626
Max. :21.00	Mission : 80249	Max. :2198	Max. :956
	(Other) :1037518

Crime.Code.Description	MO.Codes	Victim.Age
BATTERY – SIMPLE ASSAULT:145767	0344 : 173902	Min. :10.00
VEHICLE – STOLEN :121329	171759	1st Qu.:23.00
BURGLARY FROM VEHICLE :121318	0329 : 68728	Median :34.00
BURGLARY :114751	1501 : 34504	Mean :35.93
THEFT PLAIN – PETTY ($950 & UNDER) : 113709	0416 : 23997	3rd Qu.:48.00
THEFT OF IDENTITY :100653	0325 : 21546	Max. :99.00
(Other) :866789	(Other) :866789	NA’s :128659

Victim.Sex	Victim.Descent	Premise.Code
:145199	H :549515	Min. :101.0
-: 1	W :391855	1st Qu.:102.0
F:675402	B :255056	Median :210.0
H: 53	O :152776	Mean :312.4
M:739581	:145232	3rd Qu.:501.0
X: 24080	X : 41535	Max. :971.0
	(Other): 48347	NA’s :76

Premise.Description	Weapon.Used.Code
STREET :352160	Min. :101.0
SINGLE FAMILY DWELLING :328198	1st Qu.:400.0
MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC):204980	Median :400.0
PARKING LOT :112576	Mean :370.6
SIDEWALK : 79247	3rd Qu.:400.0
OTHER BUSINESS : 71097	Max. :516.0
(Other) :436058	NA’s :1059559

Weapon.Description	Status.Code
:1059560	IC :1227180
STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE): 319818	AO : 178175
VERBAL THREAT : 43814	AA : 162424
UNKNOWN WEAPON/OTHER WEAPON : 40746	JA : 12619
HAND GUN : 25352	JO : 3889
SEMI-AUTOMATIC PISTOL : 10096	CC : 24
(Other) : 84930	(Other) : 5

Status.Description	Crime.Code.1	Crime.Code.2	Crime.Code.3
Adult Arrest: 162424	Min. : 110.0	Min. : 121.0	Min. : 93.0
Adult Other : 178175	1st Qu. : 330.0	1st Qu.: 998.0	1st Qu. : 998.0
Invest Cont :1227180	Median : 440.0	Median : 998.0	Median : 998.0
Juv Arrest : 12619	Mean : 506.9	Mean : 954.5	Mean : 970.7
Juv Other : 3889	3rd Qu.: 626.0	3rd Qu.:998.0	3rd Qu. : 998.0
UNK : 29	Max. : 999.0	Max. : 999.0	Max : 999.0
	NA’s : 7	NA’ : 1484319	NA’s : 1582133

Crime.Code.4	Address
Min. : 421.0	6TH ST : 3692
1st Qu.: 998.0	7TH ST : 2793
Median : 998.0	9300 TAMPA AV : 2724
Mean : 967.2	6TH : 2231
3rd Qu.: 998.0	5TH ST : 2230
Max. : 999.06600	TOPANGA CANYON BL: 2226/td>
NA’s : 1584247	(Other) : 1568420

Cross.Street	Location
:1321583	(0, 0) : 5482
BROADWAY : 4662	(34.1016, -118.3387): 2915
FIGUEROA : 2830	(34.1905, -118.6059) : 2202
VERMONT AV: 2776	(33.9892, -118.3089) : 1881
WESTERN AV: 2641	(34.1576, -118.438) : 1681
SAN PEDRO : 2616	(34.2216, -118.4488) : 1627
(Other) : 247208	(Other) : 1568528

1.2. Exploratory data analysis

Victim.Age is a numeric discrete variable which presents the age of a victim in each crime. The main statistics of this variable are presented in the table below. The median of victim age is 34 years, but the mean 35.93 is slightly greater than the median. The Minimum value is 10 years old and the maximum value is 99 years old.

summary(crimes$Victim.Age)
Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
10.00	23.00	34.00	35.93	48.00	99.00

table(crimes$Victim.Age)
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24852 25812 25698 27806 32816 38223 31918 12842 17211 22570 26895 30078 32671 3513124 252627 28 29 30 31 32 33 34 35 36 37 36096 36939 35693 35097 34517 34323 33686 31644 31194 30281 30022 30367 28006 26623383940 41 42 43 44 45 46 47 48 49 50 51 26082 25900 26863 24747 24753 24689 23981 24228 23969 23484 22566 22166 22852 2123152 5354 55 56 57 58 59 60 61 62 63 64 65 20859 20077 18965 18244 17260 15972 14878 14111 12809 11575 10904 9948 9146 857566 67 68 6970 71 72 73 74 75 76 77 78 79 7551 6915 6043 5742 5130 4388 4165 3606 3257 3208 2724 2535 2366 217880 81 82 83 84 85 86 8788 89 90 91 92 93 2082 1754 1639 1484 1374 1214 1071 923 861 696 620 481 379 34394 95 96 97 98 9994 95 96 97 98 99 257 183 126 98 66 353

Moreover, if we observe the frequency table of this discrete variable “victim age”, we found that 15 years old is the top age with 38223 crimes in this population.

Figure 6 : frequency of victim.age variable

We proposed to plot victim.age data frequencies in figure 6, then observe the distribution of data and the means. The histogram, presented below, of Victim.age data shows bimodal, normal distribution that is right skewed with a pic in 10 and 25 years intervals .

For Area.Name categorical variable, the frequency table presents 42 categories or name of areas concerned by this dataset. The four top area has 105752 crimes in comparison to the rest of areas:

table(crimes$Victim.Age)
77th Street Central Devonshire Foothill Harbor Hollenbeck Hollywood 105752 62170 67001 55248 65438 49886 64378Mission N Hollywood Newton Northeast Olympic Pacific Rampart 73425 83070 66062 68282 60997 74260 63765Southeast Southwest Topanga Van Nuys West LA West Valley Wilshire 79269 97839 69635 67869 61861 59357 60093

We propose to plot victim.age data distribution and for each area.name and observe the distribution of data and the means.

Figure 7 : distibution of victim.age in area names

The figure 7 illustrates the distribution of victim ages by area name using side by side box plot. The figure presents a clear difference in means between areas and that “77th Street” area name presents the highest number crimes.

1.3. Descriptive statistics using Mean

We propose to analyze the mean of the variable victim age using the explanatory variable area.name and check if there is a relation in the mean difference between area.name.

# Calculate the mean
mean_c = crimes[, mean(Victim.Age), by=Area.Name]
mean_c# Create a bar plot
barplot(mean_c$V1, ylim=c(10,60), names.arg=mean_c$Junction_Control, main=”victim age for Different area”, xlab=”areas”, ylab=”Average victim age (out of 3)”, xpd=FALSE)

Figure 8 : victim age means for different area names

Figure 8 illustrates using an histogram the distribution of mean by area name. “West LA” has the highest age mean by 40.51 years and “Newton” has the lowest with 32.12 years

#finding mean parameters
mean <- tapply(crimes$Victim.Age,crimes$Area.Name,mean)
mean
barplot(tapply(crimes$Victim.Age,crimes$Area.Name,mean))

This observed main difference in victim age can be explained by other variables or demographic data of the area region. In the next sections, the document presents the application of t-test and ANVA methods to analye the impact of each area of crime in victim age.

1.4. Hypothesis testing

This section’s purpose is to check if there is a statistically significant difference in victim.age between area of crime “Newton” and area of crime “West LA“ in the crime data:

Figure 9: side by side box plot of victim age for 2 areas of crime

Figure 9 shows the side by side plot of victim age from “Newton” area and Mission area. This graph presents a difference in means between the two areas. “West LA” has the sighest mean of 40.51 years versus 32.12 years old.

Null Hypothesis (H0) : the group with the area name “West LA” has the same impact as the group of area name “Newton “
Alternative Hypothesis : the group with the area name “West LA” has not the same impact as the group with the area name “Newton“

The P-value of t-test statistic for mean comparison of victim age in crime between “Newton” and “West LA” is less than the specify α-level (0.05). Hence the victim age has significant difference.

1.5. ANOVA

This section’s purpose is to check using ANOVA technique, if there is a statistically significant difference in crime victim.age variable for each name areas.

Null Hypothesis (H0) : the group with the area name “West LA” has the same impact as the group of area name “Newton “
Alternative Hypothesis : the group with the area name “West LA” has not the same impact as the group with the area name “Newton“

The P-value analysis of variance for the area name of crime occurred in different kind of area name is lower than our α-level (0.05) of significant. Then, the analysis concluded that the victim age has significant effect on the victim.age.

The result of the global comparison is thus obtained: the effect of area.name variable in victim age is significant at (F = 7957).

The model.tables (l) function returns the sizes of effect, ie, here, the differences between each area name and the overall average.

The plot (l) function provides different graphs diagnosing the validity of the model (homogeneity of intra-group variances, normal distribution of residues,etc …)

plot(l)

Discussion

The aim of this second part of the study was to identify the effect of the area name on the victim age in crime dataset, in road accidents at US based on data obtained from the police records of the accident cases. Methods used for data analysis were hypothesis testing using t-test and ANOVA. The results of t-test and ANOVA confirm that is a statistically significant difference in victim.age between areas. “West LA” area has the highest victim age mean by 40.51 years and “Newton” has the lowest with 32.12 years.

Conclusion

The results show that there is a difference in victim age means (statistically significant) between areas, especially between “West LA” and “Newton”. Victim age of crimes is related to the area characteristics, which can be studied from different perspectives. The study of the demographic structure, the population or the economic conditions of the areas can help to reveal the root cause of this relation and then the factors that impact the victim age.

Referencing

http://www.sthda.com/french/wiki/anova-analyse-de-variance-avec-r

https://www.kaggle.com/cityofLA/crime-in-los-angeles

https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales

https://explorable.com/branches-of-statistics

http://www.acaresiran.com/Content/Articles/17/ArticleFiles/17/File.pdf

BookMyEssay hires best Australian writers who are well-versed with the guidelines and requirements of the case study writing, homework writing dissertation or assignment writing task. They pledge to prepare the content in close compliance with university guidelines and help the students to secure good grades in their exams which can be crucial landmark in their academic career.

[Download not found]

Download

505

Size

140.91 KB

File Type