We Have Numbers Of Free Samples


For Each Subject To Make A Difference In Your Grade

 
 
 
 

Anova – Data Analytics



Total Views 707

Abstract

Data analytics field enables decision makers to better understand data, detect trends, explore relationships and drive insights that will be useful to take an efficient decision In future from historical data. Data analytics is the process of analyzing a specific data which involves various manipulations on the selected data sets to draw meaningful conclusions.

The aim of this paper is to present some of the more important statistical techniques and highlight their uses based on two case studies in accidents and crime data.

Introduction

Statistics apply mathematics disciplines to analyze data. Most domains use data analysis to planning, analysis and interpret experimental work or historical data, especially for critical topics like crime and accident. Various existing studies have explored this type of data and look at the association between variables that capture attention to aid interpretation and presentation of results. Use the appropriate statistical method is essential to assess and make inferences in these critical topics; for example : the evaluation of the variation of Traffic Volume and Number of Accidents per Hour or the evaluation of the impact of round specificities in accident severity.

There is various statistic techniques and methods proposed in the data analysis field. Descriptive statistics and inferential statistics are among the main statistical branches of statistics which are employed in scientific analysis of data and both are equally important for researchers in statistics.

Descriptive statistics summarizes data from a population sample using indicators like mean, median, minimum, maximum values, standard deviation and/or coefficient of variation. This type of statistics deals with the capture and presentation of data and often constitutes the first part of a statistical analysis. The statistician needs to be aware of designing experiments in this step, choosing the right focus group and avoid biases.

The mean, or more precisely the arithmetic mean, is simply the arithmetic average of a group of numbers (or data set) and is shown using -bar symbol {displaystyle {bar {}}}. So the mean of the variable {displaystyle x}x is {displaystyle {bar {x}}}pronounced “x-bar”. It is calculated by adding up all of the values in a data set and dividing by the number of values in that data set.

However, inferential statistics allow us to draw the right conclusions from the statistical analysis that has been performed using descriptive statistics. Among the main used techniques we found Hypotheses testing and ANOVA.

Two characteristics of data sets must be considered prior to the application of any inferential tests :1-Do the data conform to the principles of normality . ie normal distribution of data (ND)? 2- Do the data satisfy an assumption of homoscedasticity, i.e. uniformity of variance?

Most predictions of the future and generalizations of results to averall population using a smaller sample come under the umbrella of inferential statistics. By designing the right experiment and using the right techniques, the researcher is able to draw relevant conclusions to his study.

Hypothesis Testing is a statistical hypothesis is an assumption made by the researcher about the population of data collected for any experiment. This assumption can be true or false. Hypothesis testing is, the formal way of validating the hypothesis made by the researcher and check whether the hypothesis is accepted or rejected.

In order to validate a hypothesis, it will consider the entire population which is not possible in practice. So, to validate a hypothesis, we use random samples from a population and on the basis of the result from testing over the sample data, we can either select or reject the hypothesis.

Statistical Hypothesis can be categorized into 2 types as below:

  • Null Hypothesis –Hypothesis tests are used to test the validity of a claim that is made about a population. This claim that’s on trial, in essence, is called the null hypothesis. The null hypothesis testing is denoted by H0.
  • Alternative Hypothesis –The alternative hypothesis is the one you would believe if the null hypothesis is concluded to be untrue. The evidence in the trial is your data and the statistics that go along with it. The alternative hypothesis testing is denoted by H1or Ha.

Hypothesis testing uses a p-value to evaluate the strength of the evidence. The p-value is a number between 0 and 1 and interpreted in the following way:

  • A small p-value (typically ≤0.05) indicates strong evidence against the null hypothesis, so you reject it.
  • A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject it.
  • A p-value very close to the cutoff (0.05) is considered to be marginal and could go either way.

The Analysis of Variance (ANOVA)

Analysis of variances (ANOVA) is a parametric statistical technique that partitions the observed variance into components that arise from different sources of variation. In its simplest form, ANOVA technique provides a statistical test of whether or not the means of several groups are all equal. In this sense, the null hypothesis, H0, says there are no differences among results from different treatments or sample sets; the alternative hypothesis (Ha) is that the results do differ. If the null hypothesis is rejected then the alternative hypothesis, Ha, is accepted, i.e. at least one set of results differs from the others.

The ANOVA technique is generally used to compare the mean values of three or more data sets.

This study focus on the three kinds of statistic methods frequently used in this field as noted above, to practice and apply in two different case studies.

The purpose of this document is to present and apply three statistic techniques, then discuss the outcome of applying them to our datasets.

The reminder of this paper presents two case studies. In each case study, it presents the dataset structure, applies at least 2 statistical techniques on it and discuss the results of using these techniques and if one supported more than others.

1. Case study 1 : UK accidents data

Road accident is most unwanted thing to happen to a road user, over speeding, Red light Jumping or Avoiding Safety Gears like Seat belts and Helmets are some reasons of the common behaviors that generate fatal accidents and increase the degree of the severity of accident.

The United States is one of the busiest countries in terms of road traffic. The level of traffic is one of the reasons leading to more traffic accidents: In 2015, there were some 6.3 million fatal, injury, and property damage crashes that occurred in the U.S. alone.

The UK government amassed traffic data from 2000 and 2016, recording over 1.6 million accidents in the process and making this one of the most comprehensive traffic data sets out there. It’s a huge picture of a country undergoing change.

Data analysis of accidents data  is the process of analyzing accidents data sets to derive useful conclusions and/or informations based on the approach presented in previously. We propose to analyze the impact of Junction control types at the number of vehicles impacted in accident using the explanatory variable Junction_control and check if there is a relation in the mean difference between junction control types. This case study uses three statistic techniques : mean analysis, Hypothesis testing and Analysis of Variance(ANOVA).

This case study purpose is to check if there is a statistically significant difference in variable “number of vehicles” between the two groups of “junction control” : “Giveway or uncontrolled”  “Stop Sign“ in the accidents data.

  • Null Hypothesis (H0) :  Junctio control of type “Giveway or uncontrolled” has the same impact like  Junction control of type “Stop Sign “
  • Alternative Hypothesis : Junctio control of type “Giveway or uncontrolled”  has not the same impact like  Junction control of type “Stop Sign “

1.1. Data Preparation and Cleaning

All the contained accident data come from police reports, so this data does not include minor incidents. “UK road accidents from 2012 to 2014” file Contains data of UK accidents from 2012 to 2014.

Figure 1 : snapshot of accident data

Data variables:

    • Unnamed: 0
    • Accident_Index
    • Location_Easting_OSGR
    • Location_Northing_OSGR
    • Longitude
    • Latitude
    • Police_Force
    • Accident_Severity
    • Number_of_Vehicles
    • Number_of_Casualties
    • Date
    • Day_of_Week
    • Time
    • Local_Authority_(District)
    • Local_Authority_(Highway)
    • 1st_Road_Class
    • 1st_Road_Number
    • Road_Type
    • Speed_limit
  • Junction_Detail
  • Junction_Control
  • 2nd_Road_Class
  • 2nd_Road_Number
  • Pedestrian_Crossing-Human_Control
  • Pedestrian_Crossing-Physical_Facilities
  • Light_Conditions
  • Weather_Conditions
  • Road_Surface_Conditions
  • Special_Conditions_at_Site
  • Carriageway_Hazards
  • Urban_or_Rural_Area
  • Did_Police_Officer_Attend_Scene_of_Accident
  • LSOA_of_Accident_Location
  • Year

The data analysis is conducted using R programming language. The main R libraries used for this analysis are ggplot2, readr and data.table.

After loading the main libraries,  the R script checks the main statistics of each variable using the summary function. The results show that some variables and lines are not fully containing data which require cleaning operation and removing remaining rows with NAs values.

1.2. Exploratory data analysis

For Junction_Control variable is a categorical variable with exactly 178610 effective values of 4 categories as presented in the table below :

summary(accidents1214$Junction_Control)
Junction control Giveway or uncontrolled Automatic traffic signal Stop Sign Authorised person NA
Number 232915 50208 2287 677 286087
Percentage 40.7% 8.8% 0.4% 0.1% 50.0%

 

286087 represents the NA values that we will remove from data in the next sections. Giveway or uncontrolled group have the top number of accidents by 40.7% and the stop sign group has the lowest with 0.4%

Figure 2 : box plot of number of vehicles by Junction control groups

The figure 2 presents the side by side box plot of junction control by number of vehicles. It summarizes a data set using important statistics (first quartile, median, third quartile and upper whisker) for comparing data across groups. This figure illustrates the repartition of number of vehicles in each group and shows that “Giveway or Uncontrlled” group produces a big number of impacted vehicles(max of 11 vehicles) than the other groups. The  “Authorised person” group generates a maximum of 5 impacted vehicles.

Figure 3 : Number of vehicles histogram

Figure 3 presents the number of vehicles histogram that illustrates the density of the variable Number of vehicles. Most of accidents are registered for 2 vehicles (198105 accidents), followed by  1 vehicle (65251 accidents). We note also the presence of some outliers (59 accidents), an accident of 11 vehicles in group” Giveway or uncontrolled”.

1.3. Descriptive statistics using Mean

The mean, or more precisely the arithmetic mean, is simply the arithmetic average of a group of numbers (or data set) and is shown using -bar symbol {displaystyle {bar {}}}. So the mean of the variable {displaystyle x}x is {displaystyle {bar {x}}}pronounced “x-bar”. It is calculated by adding up all of the values in a data set and dividing by the number of values in that data set.

We propose to analyze the mean of Number_of_Vehicles using the independent variable Junction_control and check if there is a relation in mean difference between junction control types as presented in the table below.

# Calculate the mean Number_of_Vehicles for each speed limitmeanJunction_Control2 = accidents1214[, mean(Number_of_Vehicles), by=Junction_Control]
meanJunction_Control2# Create a bar plot of the Number_of_Vehicles
barplot(meanJunction_Control2$V1, ylim=c(0,3), names.arg=meanJunction_Control2$Junction_Control, main=”Number_of_Vehicles for Different Junction_Control”, xlab=”Junction_Control”, ylab=”Average Number_of_Vehicles (out of 3)”, xpd=FALSE)

Figure 4 : Number of vehicles per Junction control

Table 1  : mean and standard deviation of number of vehicles

Estimate parameter Authorised person Automatic traffic signal Automatic traffic signal Giveway or uncontrolled Stop Sign
Mean 1.762813 1.748892 1.857214 1.870863 1.953214
Standard deviation 0.8570673 0.6404876 0.5964850 0.5938531 0.5324114

 

From Figure 4 and table 1, the calculated mean of “number of vehicles” are respectively 1.87 and 1.95 for “Giveway or uncontrolled” group and “Stop Sign” group. “Stop Sign” group is slightly bigger than “Giveway or uncontrolled” group. Even if the max of impacted vehicles is the highest in the group “Giveway or uncontrolled” the mean of “Stop Sign” group is the highest.

1.4. Hypothesis Testing

Hypothesis testing is, the formal way of validating the hypothesis defined above and check whether the hypothesis is accepted or rejected. Hypothesis testing uses a p-value to evaluate the strength of the evidence.

This section’s purpose is to check if there is a statistically significant difference in number of vehicles between junction control of type “Giveway or uncontrolled”  and Junction control of type “Stop Sign“ in the accidents data :

  • Null Hypothesis (H0) :  Junctio control of type “Giveway or uncontrolled”  the same impact like  Junction control of type “Stop Sign “
  • Alternative Hypothesis : Junctio control of type “Giveway or uncontrolled”  has not the same impact like  Junction control of type “Stop Sign “
Group Giveway or uncontrolled Stop Sign
Number 232915 2287
Percentage 40.7% 0.4%

Apply Student’s t-test in R using accidents data

The Student’s t-test is a method for comparing two samples. It can be implemented to determine whether the samples are different. This is a parametric test, and the data should be normally distributed. R can handle the various versions of t-test using the t.test() command. The test can be used to deal with two- and one-sample tests as well as paired tests.

 

“`{r}

#Using the Student’s t-test in R

accidents1214_2types=accidents1214[Junction_Control==’Giveway or uncontrolled’ | Junction_Control==’Stop Sign’]

t.test(accidents1214_2types$Number_of_Vehicles ~ accidents1214_2types$Junction_Control)

Results :

Welch Two Sample t-test

data:  accidents1214_2types$Number_of_Vehicles by accidents1214_2types$Junction_Controlt = -7.3522, df = 2342.2, p-value = 2.681e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.10431566 -0.06038645
sample estimates:
mean in group Giveway or uncontrolled               mean in group Stop Sign                              1.870863                              1.953214

 

The P-value of t-test statistic for mean comparison of number of vehicles committed in “Giveway or uncontrolled” and “Stop Sign” is less than the specify α-level (0.05). Hence the number of impacted vehicles in “Giveway or uncontrolled” and “Stop Sign” had significant different.

1.5. ANOVA

This section’s purpose is to check using ANOVA technique, if there is a statistically significant difference in number of vehicles between junction control of type “Giveway or uncontrolled”  and Junction control of type “Stop Sign“ in the accidents data

The P-value analysis of variance for the number of vehicles occurred in different kind of Junction control less than our α-level (0.05) of significant. Then, analysis concluded that the type of Junction control had significant effect on the number of impacted vehicles.

l<-aov(Number_of_Vehicles ~ Junction_Control, data = accidents1214)

summary(l);

model.tables(l)

plot(l)

Results :

Df Sum Sq Mean Sq F value Pr(>F)

Junction_Control      3     34  11.254    31.9 <2e-16 ***

Residuals        286083 100928   0.353

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Tables of effects

 

Junction_Control

Authorised person Automatic traffic signal Giveway or uncontrolled Stop Sign

-0.1199               -1.162e-02               2.026e-03 8.438e-02

rep          677.0000                5.021e+04               2.329e+05 2.287e+03

 

The p(<2e-16 ) value is lower than 0.05. The result of the global comparison is thus obtained: the effect of Junction_Control   is significant at the 0.001 threshold (F = 625.2).

The model.tables (l) function returns the sizes of effect, ie, here, the differences between each group and the overall average (Note that this result could have been obtained by tapply function too.

The plot (l) function provides different graphs diagnosing the validity of the model (homogeneity of intra-group variances, normal distribution of residues,etc …) among them, we illustrate below : Fitted values versus Residuals plot and Normal –Q plot.

plot(l)

Discussion

The aim of this first part of the study was to identify the effect of the Junction control on the number of vehicles, in road accidents at US based on data obtained from the police records of the accident cases. Methods used for data analysis were hypothesis testing using t-test and ANOVA.

Both methods hypotheses testing using t-test and ANOVA confirm that the number of impacted vehicles of road traffic accidents is associated with the type of junction control, especially between “Giveway or uncontrolled” and “Stop Sign”. “Stop Sign” group mean is slightly greater than “Giveway or uncontrolled” group. Even if the max of impacted vehicles is the highest in the group “Giveway or uncontrolled” the mean of “Stop Sign” group still the highest.

Conclusion

The overall findings of this study indicate that the number of impacted vehicles of road traffic accidents was associated with the type of junction control, especially between “Giveway or uncontrolled” and “Stop Sign”. Even if there is a high rate in impacted number of vehicles in group   “Giveway or uncontrolled” the mean of of “Stop Sign” still slightly the highest. The t-test and ANOVa method used in this case study have revealed a difference in means between these groups, however, the degree of impact and the degree of impact is not clear. There is another statistics techniques like regression or decision trees that can help to identify the type and degree of impact

1. Case study 2 : US crime data

1.1. Data preparation and cleaning

“Crime_Data_2010_2017.csv” file Contains data of crimes from 2010 to 2017 with 26 variables and 1584316 observations. Below is a snapshot of the selected dataset.

The data analysis of the second case study is conducted using R programming language too. The main R libraries used for this analysis are ggplot2, readr and data.table. To explore data and check main statistics of each variables we have used the R summary function.

summary(crimes)
DR Number Date.Reported Date.Occurred Time.Occurred
Min. : 210 08/07/2017: 804 01/01/2010: 2135 Min. : 1
1st Qu.:112117618 07/24/2017: 782 01/01/2011: 2050 1st Qu.: 930
Median :140109488 07/05/2016: 760 01/01/2012: 1641 Median :1430
Mean :135807195 07/17/2017: 757 01/01/2013: 1455 Mean : 1364
3rd Qu.:152018611 05/22/2017: 754 01/01/2014: 1319 3rd Qu.:1900
Max. :910220366 06/05/2017: 749 01/01/2015: 1240 Max. :2359
(Other) :1579710 (Other) :1574476

 

Area.ID Area.Name Reporting.District Crime.Code
Min. : 1.00 77th Street: 110605 Min. : 100 Min. :110
1st Qu.: 6.00 Southwest : 102259 1st Qu.: 645 1st Qu.:330
Median :12.00 N Hollywood: 86405 Median :1204 Median :440
 Mean :11.15 Pacific : 83763 Mean :1162 Mean :507
 3rd Qu.:16.00 Southeast : 83517 3rd Qu.:1676 3rd Qu.:626
Max. :21.00 Mission : 80249 Max. :2198 Max. :956
(Other) :1037518

 

Crime.Code.Description MO.Codes Victim.Age
BATTERY – SIMPLE ASSAULT:145767 0344 : 173902 Min. :10.00
VEHICLE – STOLEN :121329 171759 1st Qu.:23.00
BURGLARY FROM VEHICLE :121318 0329 : 68728 Median :34.00
BURGLARY :114751 1501 : 34504 Mean :35.93
THEFT PLAIN – PETTY ($950 & UNDER) : 113709 0416 : 23997 3rd Qu.:48.00
THEFT OF IDENTITY :100653 0325 : 21546 Max. :99.00
(Other) :866789 (Other) :866789 NA’s :128659

 

Victim.Sex Victim.Descent Premise.Code
:145199 H :549515 Min. :101.0
-: 1 W :391855 1st Qu.:102.0
F:675402 B :255056 Median :210.0
H: 53 O :152776 Mean :312.4
M:739581 :145232 3rd Qu.:501.0
X: 24080 X : 41535 Max. :971.0
(Other): 48347 NA’s :76

 

Premise.Description Weapon.Used.Code
STREET :352160 Min. :101.0
SINGLE FAMILY DWELLING :328198 1st Qu.:400.0
MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC):204980 Median :400.0
PARKING LOT :112576 Mean :370.6
SIDEWALK : 79247 3rd Qu.:400.0
OTHER BUSINESS : 71097 Max. :516.0
(Other) :436058 NA’s :1059559

 

Weapon.Description Status.Code
:1059560 IC :1227180
STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE): 319818 AO : 178175
VERBAL THREAT : 43814 AA : 162424
UNKNOWN WEAPON/OTHER WEAPON : 40746 JA : 12619
HAND GUN : 25352 JO : 3889
SEMI-AUTOMATIC PISTOL : 10096 CC : 24
(Other) : 84930 (Other) : 5

 

Status.Description Crime.Code.1 Crime.Code.2 Crime.Code.3
Adult Arrest: 162424 Min. : 110.0 Min. : 121.0 Min. : 93.0
Adult Other : 178175 1st Qu. : 330.0 1st Qu.: 998.0 1st Qu. : 998.0
Invest Cont :1227180 Median : 440.0 Median : 998.0 Median : 998.0
Juv Arrest : 12619 Mean : 506.9 Mean : 954.5 Mean : 970.7
Juv Other : 3889 3rd Qu.: 626.0 3rd Qu.:998.0 3rd Qu. : 998.0
UNK : 29 Max. : 999.0 Max. : 999.0 Max : 999.0
NA’s : 7 NA’ : 1484319 NA’s : 1582133

 

Crime.Code.4 Address
Min. : 421.0 6TH ST : 3692
1st Qu.: 998.0 7TH ST : 2793
Median : 998.0 9300 TAMPA AV : 2724
Mean : 967.2 6TH : 2231
3rd Qu.: 998.0 5TH ST : 2230
Max. : 999.06600 TOPANGA CANYON BL: 2226/td>
NA’s : 1584247 (Other) : 1568420

 

Cross.Street Location
:1321583 (0, 0) : 5482
BROADWAY : 4662 (34.1016, -118.3387): 2915
FIGUEROA : 2830 (34.1905, -118.6059) : 2202
VERMONT AV: 2776 (33.9892, -118.3089) : 1881
WESTERN AV: 2641 (34.1576, -118.438) : 1681
SAN PEDRO : 2616 (34.2216, -118.4488) : 1627
(Other) : 247208 (Other) : 1568528

 

After loading the main libraries, the R script checks the main statistics of each variable using the summary function. The results show that some variables and lines are not fully containing data which require cleaning operation and removing remaining rows with NAs values using crimes= na.omit(crimes) script.

1.2. Exploratory data analysis

Victim.Age is a numeric discrete variable which presents the age of a victim in each crime. The main statistics of this variable are presented in the table below. The median of victim age is 34 years, but the mean 35.93 is slightly greater than the median. The Minimum value is 10 years old and the maximum value is 99 years old.

summary(crimes$Victim.Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.00 23.00 34.00 35.93 48.00 99.00

 

table(crimes$Victim.Age)
10 11 12 13 14 15 16 17 18 19 20 21 22 23
24852 25812 25698 27806 32816 38223 31918 12842 17211 22570 26895 30078 32671 3513124 252627 28 29 30 31 32 33 34 35 36 37
36096 36939 35693 35097 34517 34323 33686 31644 31194 30281 30022 30367 28006 26623383940 41 42 43 44 45 46 47 48 49 50 51
26082 25900 26863 24747 24753 24689 23981 24228 23969 23484 22566 22166 22852 2123152 5354 55 56 57 58 59 60 61 62 63 64 65
20859 20077 18965 18244 17260 15972 14878 14111 12809 11575 10904 9948 9146 857566 67 68 6970 71 72 73 74 75 76 77 78 79
7551 6915 6043 5742 5130 4388 4165 3606 3257 3208 2724 2535 2366 217880 81 82 83 84 85 86 8788 89 90 91 92 93
2082 1754 1639 1484 1374 1214 1071 923 861 696 620 481 379 34394 95 96 97 98 9994    95    96    97    98    99
257 183 126 98 66 353

 

Moreover, if we observe the frequency table of this discrete variable “victim age”, we found that 15 years old is the top age with 38223 crimes in this population.

Figure 6 : frequency of victim.age variable

We proposed to plot victim.age data frequencies in figure 6, then observe the distribution of data and the means. The histogram, presented below, of Victim.age data shows bimodal, normal distribution that is right skewed with a  pic in 10 and 25 years intervals .

For Area.Name categorical variable, the frequency table presents 42 categories or name of areas concerned by this dataset. The four top area has 105752 crimes in comparison to the rest of areas:

table(crimes$Victim.Age)
77th Street     Central  Devonshire    Foothill      Harbor    Hollenbeck     Hollywood
105752             62170      67001           55248         65438         49886           64378Mission N      Hollywood      Newton    Northeast     Olympic     Pacific     Rampart
73425                83070            66062       68282          60997        74260       63765Southeast      Southwest     Topanga    Van Nuys     West LA      West Valley    Wilshire
79269               97839            69635       67869           61861              59357            60093

 

We propose to plot victim.age data distribution and for each area.name and observe the distribution of data and the means.

Figure 7 : distibution of victim.age in area names

The figure 7 illustrates the distribution of victim ages by area name using side by side box plot. The figure presents a clear difference in means between areas and that “77th Street” area name presents the highest number crimes.

1.3. Descriptive statistics using Mean

We propose to analyze the mean of the variable victim age using the explanatory variable area.name and check if there is a relation in the mean difference between area.name.

# Calculate the mean
mean_c  = crimes[, mean(Victim.Age), by=Area.Name]
mean_c# Create a bar plot
barplot(mean_c$V1, ylim=c(10,60), names.arg=mean_c$Junction_Control, main=”victim age for Different area”, xlab=”areas”, ylab=”Average victim age (out of 3)”, xpd=FALSE)

Figure 8 : victim age means for different area names

Figure 8 illustrates using an histogram the distribution of mean  by area name. “West LA” has the highest age mean by 40.51 years and “Newton” has the lowest with 32.12 years

#finding mean parameters
mean <- tapply(crimes$Victim.Age,crimes$Area.Name,mean)
mean
barplot(tapply(crimes$Victim.Age,crimes$Area.Name,mean))

This observed main difference in victim age can be explained by other variables or demographic data of the area region. In the next sections, the document presents the application of t-test and ANVA methods to analye the impact of each area of crime in victim age.

1.4. Hypothesis testing

This section’s purpose is to check if there is a statistically significant difference in victim.age between area of crime “Newton” and area of crime “West LA“ in the crime data:

Figure 9: side by side box plot of victim age for 2 areas of crime

Figure 9 shows the side by side plot of victim age from “Newton” area and Mission area. This graph presents a difference in means between the two areas. “West LA” has the sighest mean of 40.51 years versus 32.12 years old.

  • Null Hypothesis (H0) : the group with the area name “West LA” has the same impact as the group of  area name “Newton “
  • Alternative Hypothesis : the group with the area name “West LA” has not the same impact as the group with the  area name “Newton“

The P-value of t-test statistic for mean comparison of victim age in crime between “Newton” and “West LA” is less than the specify α-level (0.05). Hence the victim age has significant difference.

1.5. ANOVA

This section’s purpose is to check using ANOVA technique, if there is a statistically significant difference in crime victim.age variable for each name areas.

  • Null Hypothesis (H0) : the group with the area name “West LA” has the same impact as the group of  area name “Newton “
  • Alternative Hypothesis : the group with the area name “West LA” has not the same impact as the group with the  area name “Newton“

The P-value analysis of variance for the area name of crime occurred in different kind of area name is lower than our α-level (0.05) of significant. Then, the analysis concluded that the victim age has significant effect on the victim.age.

The result of the global comparison is thus obtained: the effect of area.name variable in victim age   is significant at (F = 7957).

The model.tables (l) function returns the sizes of effect, ie, here, the differences between each area name and the overall average.

The plot (l) function provides different graphs diagnosing the validity of the model (homogeneity of intra-group variances, normal distribution of residues,etc …)

plot(l)

 

Discussion

The aim of this second part of the study was to identify the effect of the area name on the victim age in crime dataset, in road accidents at US based on data obtained from the police records of the accident cases. Methods used for data analysis were hypothesis testing using t-test and ANOVA. The results of t-test and ANOVA confirm that is a statistically significant difference in victim.age between areas. “West LA” area has the highest victim age mean by 40.51 years and “Newton” has the lowest with 32.12 years.

Conclusion

The results show that there is a difference in victim age means (statistically significant) between areas, especially between “West LA” and “Newton”. Victim age of crimes is related to the area characteristics, which can be studied from different perspectives. The study of the demographic structure, the population or the economic conditions of the areas can help to reveal the root cause of this relation and then the factors that impact the victim age.

Referencing

http://www.sthda.com/french/wiki/anova-analyse-de-variance-avec-r

https://www.kaggle.com/cityofLA/crime-in-los-angeles

https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales

https://explorable.com/branches-of-statistics

http://www.acaresiran.com/Content/Articles/17/ArticleFiles/17/File.pdf

BookMyEssay hires best Australian writers who are well-versed with the guidelines and requirements of the case study writing, homework writing dissertation or  assignment writing task. They pledge to prepare the content in close compliance with university guidelines and help the students to secure good grades in their exams which can be crucial landmark in their academic career.

[Download not found]


Download

505

Size

140.91 KB

File Type

HIRE EXPERT

|

|

|

Page250 Words

Subject Categories



Get Guaranteed Higher Grades
Book Your Order