Analysis of model fit and item parameter of work and energy test using item response theory

One of the important parts in assessing learning outcomes is using a good instrument that is analyzed using an appropriate analytical model and can measure students' abilities accurately. This study aims to determine the model fit and item parameter of work and energy test using item response theory. This research is a quantitative research that was carried out on the responses of 1177 high school students spread across Banten Province. The instrument is a set of work and energy tests consisting of 25 multiple choices. The data analysis used the item response theory approach with statistical methods ranging from determining the fit model to the item characteristics.. The analysis showed the students' responses to the fit energy and effort test with 9 items of 1 PL model, 17 items of 2 PL model, and 16 items of 3 PL model. Based on the percentage, the 2PL model is suitable than 1PL and 3PL. Further analysis determines the item parameter value by referring to the 2PL model, namely the item parameter difficulty level (b) and discrimination (a). The result shows that all items have difficulty in the range of 2.501 to 1.595, and the discrimination was in the range of 0.289 to 1.109. Based on this analysis, it can be concluded that all items in this test are the good item criteria


INTRODUCTION
Today Indonesia's education is faced with quality challenges. Various programs are designed to produce quality education. A good program must be based on accurate data to produce optimal effects. This accurate data can be obtained through a good process. Mardapi (2017) shows that the quality of education can be improved through the quality of learning and quality of assessment. Teachers must be able to prepare learning materials developed based on the competencies and characters (Widya, Hamdi & ahmad, 2017). The right decision in the assessment system will be helpful for further decision-making.
Assessment of learning outcomes is carried out by providing tests that will assess students' abilities and determine completeness and achievement in certain fields of study (Gronlund, 1998). The more specific part of the assessment is measurement. Measurement is an activity to assign numbers to an individu-al or individual characteristics according to certain rules (Griffin & Nix, 1991;Ebel & Frisbie, 1986).
In practice, the assessment is carried out using tests and non-tests. Generally, the most widely used assessment is a test. The test is a question given to the test to get answers from the test in the form of an oral or oral test or an action test or action test (Baskoro & Wihaskoro, 2013). The test can also be viewed as part of a measurement of learning outcomes. The definition of measurement is an activity to distinguish a person's characteristics or attributes (Oriondo & Antonio, 1998).
Using the test instrument for the assessment of learning outcomes is very important to ascertain the item parameters used. The parameter of the test items used must meet the criteria for good items.
There are two approaches to estimating item parameters, namely classical test theory and item response theory. Classical test theory is seen to have weaknesses. According to Hambleton, Swaminathan, & Rogers (1991), the main drawback is that the characteristics of the examinees and the characteristics of the examinations cannot be separated, each of which can be interpreted only in other contexts. The test score itself only determines the ability of the examinee. When the test is difficult, the examinee will get a low score, and it can be concluded that the examinee's ability appears low.
On the other hand, when the test is easy, the examinee will get a big score and appear to have higher ability. In other words, the estimation of item parameters depends on the examinee and vice versa. Item characteristics will change when the examinee changes and the characteristics of the examinees change when the characteristics of the item change. Based on this explanation, there are limitations to the use of classical test theory because it will depend on the assessment subject.
Item response theory is a solution to overcoming weaknesses in classical test theory because item response theory has the concept of releasing the link between the Item and the test taker. The characteristics of the examinees will remain the same even though they work on the items with various characteristics, and vice versa, the characteristics of the test items will remain the same even though test-takers carry them out with different abilities. According to Hambleton et al., (1991), grain response theory rests on two basic postulates; (a) the ability of the test taker can be predicted (or explained) by a factor called trait, latent nature, or ability; and (b) the relationship between the abilities of the test taker and the characteristics of the test itself can be explained by a monotonically increasing function known as the Item characteristic function or item characteristic curve. This function explains that as the ability increases, the likelihood of the test taker answering correctly to an item increases. In Figure 1 we can see that the group of test-takers with high abilities will have a greater chance of answering correctly than the group with low abilities. In line with this, with the IRT analysis, the weaknesses of applying the cla ssical test theory can be resolved, namely: (1) the estimation of the test taker's ability does not depend on the characteristics of the test used; (2) estimated item parameters that do not depend on the ability of the teste; and (3) measurement error could be searched for each individual (Susongko, 2016).
The function of item response theory can be applied when the model used is compatible with the tested data (Hambleton et al., 1991). Stone & Zhang (2003) stated that grain estimation parameters could be disturbed if the model used is not suitable. Hambleton et al. (1991) describe several logistic models in item response theory, namely the one-parameter logistics model (1PL), the two-parameter logistic model (2PL), and the three-parameter logistic model (3PL). Each model has a certain number of grain parameters. Each parameter of an item will form an item response function.
The one-parameter logistic model (1PL) is an item response theory model with only one parameter: the level of difficulty. This model assumes that the test taker's ability is only affected by the difficulty level of the test items. An item is said to be good if it is in the range -2, which means easy to +2, which means difficult. The function of the PL model 1 can be seen in equation 1. (1) The two-parameter logistic model (2PL) has two parameters: the level of difficulty and discrimination, where the discrimination is in the range 0 and 2. In the grain characteristic curve, the discrimination is indicated by the slope of the curve. Items with high differing power have a steep curve. Grains with high differentiation power will better differentiate test takers who have a high ability from testtakers who have the low ability. The function of the 2PL model can be seen in equation 2. (2) The three-parameter logistic model (3PL) has three parameters: difficulty level, discrepancy, and pseudo-guessing. The pseudo -guessing parameter states the probability of a test taker with a low ability to answer a difficult question by guessing correctly. The value of pseudo guessing c ranges between 0 and 1. An item is good if the value of the parameter c is not more than 1/k, where k is the sum of selection. The function of the 3PL model can be seen in equation 3. (3) According to Retnawati (2014), two ways that we can use to prove the suitability of the model are statistical methods and graphical methods. The statistical method is done by calculating the chi-square value, comparing its value with a table, or looking at the probability value (significance). An item is said to be by the model if the results of the chi-squared calculation do not exceed the chi-squared value in the table or the sig> a value. While the analysis using the graph method is carried out by looking at the data distribution of the grain characteristic curve. Based on this curve, we can see the suitability of the data distribution compared to the model. The model is suitable if the distance from the point to the line is close (Retnawati, 2014).

RESEARCH METHODS
This research is quantitative research. Data obtained from student responses to the work and energy test instruments. The instrument used in this study was the Daily Physics Assessment of work and energy material. The test kit consists of 25 items in the form of multiple-choice and five choices. The test kits used previously were validated using Aiken validity. Respondents in the study were 1177 high school students spread across Banten province. The data collected was in the form of a dichotomy with a 1 if true and 0 if it was false.
The model of suitability analysis was carried out using statistical methods. After determining the appropriate model, the analysis determines the grain parameter values based on the appropriate model. The results of this item parameter analysis are seen from the output of BILOG MG 3.0 phase 2. The column "threshold" shows the difficulty level of item (b), "slope" shows the difference in power (a), and "asymptote" states the guessing parameter (c).

Table 2. Eigenvalues
Based on table 2, the eigenvalues with more than one value indicate one factor. Based on these eigenvalues, the Work and Energy test instrument has three factors. These three factors can explain the 36, 851% variance. These eigenvalues can then be presented in the scree plot in Figure 3.

RESULTS AND DISCUSSION
Before the fit test stage of the appropriate or fit parameter model, the first thing to do is test the dimensional, whether unidimensional or multidimensional. Unidimensional means that each Item measures only one ability (Retnawati, 2014). Whereas multidimensional means that some or all items measure more than one dimension. The dimensional test in this study was proven through factor analysis using SPSS.  Table 1, it can be seen that the KMO-MSA value is 0.938 and the significant Bartlett test is 0.000. It means that the sample used has met the sample adequacy requirements, and the data is homogeneous so that factor analysis can be carried out. The data processing results for factor analysis through SPSS can be seen in the eigenvalues section in Table 2  The scree plot of the factor analysis shows a very sharp decrease between factor 1 and factor 2, and the Eigenvalue then begins to skew at a factor of 3 so that the scree plot almost forms a right angle. It shows that there is only 1 dominant factor in the work and energy material test.
Another test is local independence. This assumption of local independence will be fulfilled if the participant's answer to one Item does not affect the participant's answer to another item (Retnawati, 2014). According to De Mars (2010), local independence can also be detected by proving unidimensional assumptions. It can be interpreted that if the unidimensional assumptions are met, the local independence assumption is also fulfilled. In this study, the unidimensional assumptions have been fulfilled so that the local independence test has also been fulfilled.
In this study, to determine the suitability or fitness of the logistic parameter model using statisticall analysis. Statistical analysis using Item fit has the power to detect measurement disturbances with a reasonable amount. The results show that, when the data fit the model, the distribution properties of the Item fit statistics, it is possible to construct a reasonable error rate (Smith, 1991). In this study, the suitability or fitness of the model was determined using statistical methods, namely by determining the chi-square for each Item on each logistic parameter. The technique of this method is to compare the calculated chi-square value with the chi-square table value in certain degrees of freedom. An item is deemed suitable to the logistic parameter model if thecalculated chi-square value does not exceed the table or critical chisquare value. The suitability of each Item in the 1PL, 2 PL, and 3 PL models is presented in Table 3.
Based on table 3, the number of items that fit the 1 PL model is 9 items, the 2 PL model is 17 items, and the 3 PL model is 16 items. If viewed from the percentage, the suitability with the 2PL model is the greatest compared to the 1PL and 3 PL. So it can be concluded based on this analysis that the analysis of the Work and Energy test instrument fits the 2PL parameter model. When the model fits the data, the model has shown conformity (Hattie, 1984).
Similar things can cause the number of items that do not fit the person or person fit. Meijer (1996) states that there are at least seven behaviors of test-takers when the test causes the items not to match the data. The seven behaviors, namely; a) sleep behavior, an examiner has difficulty starting a task, and after adapting, he does not check the answer; b) Guessing behavior (guessing), in which the examinee with low ability suddenly responds correctly to a complicated item; c) fraudulent behavior; d) Plodding or sluggish behavior, namely test takers who have not finished working on the problem; e) Alignment errors, occur to examinees who do not carefully respond to the answer sheets; f) too creative, that is, the examinee interprets the Item in an unusual or too creative way; g) lack of ability, occurs when the problem is measuring two different abilities. Further analysis, namely determining or estimating the difference between power parameters (a) and the level of difficulty (b) using the 2PL model. The results of this analysis produce parameter values for each Item that are presented in table 4. Table 3. The suitability of each item in the PL, 2 PL, and 3 PL models Based on the data in table 4, it appears that the difficulty level is in the range -2.501 to 1.595 and the discrimination is in the range 0.289 to 1.109. For the value of the item difference index, Alagumalai has grouped the index into: very good> 0.40, good 0.30-0.39, just 0, 20 -0, 29 unable to distinguish 0.00 -0.19, requires examination of items <0.00 (Alagumalai et.al., 2005). Based on this analysis, it can be concluded that all items in this test meet the criteria for good items.

CONCLUSION
The analysis results showed that the student's responses to the fit energy and effort tests with the 1 PL model were 9 items, the 2 PL model was 17 items, and the 3 PL model was 16 items. If viewed from the percentage, the suitability with the 2PL model is greatest than the 1PL and 3 PL. So it can be concluded based on this analysis that the analysis of the Work and Energy test instrument fits the 2PL parameter model. Further analysis is determining the item parameter value by referring to the 2PL model, namely the item parameter difficulty level (b) and discrimination (a). From the analysis conducted, it was found that the level of difficulty of the problem was in the range -2.501 to 1.595 and the power of difference was in the range 0.289 to 1.109. Based on this analysis, it can be concluded that all items in this test meet the criteria for good items.