I am having trouble figuring out the proper analysis for an experiment I designed, and am hoping for some guidance. Following is a rough analogy to what is happening in the test. It's not the actual test, but it is far easier for me describe this than the actual test which has technical terms pertaining to audio and music that can be very confusing.

Let's say we have a two sets of cubes. In one set of 10, each cube is 2 cm square, and otherwise visually identical. In the other set of 10, each cube is 4 cm square, and otherwise visual identical. In each set, the cubes range in weighs from 1kg to 2 kg, in 0.1 increments, with the weight 1.5 kg omitted. [1, 1.1, 1.2, 1.3, 1.4, 1.6, 1.7, 1.8, 1.9, 2]

In the test, for a single round, the participant first picks up and returns to the test platform a "reference" cube of 1.5 kg. The reference cube that they are given may be either the larger or smaller size. Then they pick up a cube selected for them from one of the two sets, and return that to the test platform. Lastly, they write down whether they think the 2nd cube is lighter or heavier than the first cube. During the course of the test, the test rounds cover every possible sequence. Thus we have four types of comparisons:

In the results, let's say that when comparing like-to-like sizes, the difficulty in correctly identifying whether the second cube is lighter or heavier is shown in the data to be more difficult as the second-cube's weight gets closer to 1.5 kg. If we tabulate the "misses", the second-cubes of weights 1.4kg and 1.6kg will have the highest number of "misses", while the second-cubes of 1kg or 2kg will have the most "correct" responses.

Now, let's suppose that when we handle cubes of different sizes, our brain tricks us into thinking the larger is heavier than it actually is (by maybe 0.3 kg), and this is reflected in the responses about relatives weights. Thus the results skew to reflect the mistaken perception. For example, the large-sized 1.2 kg to 1.4 kg weights will now more often be misjudged as heavier than the small-sized 1.5 kg reference cube, and the large-sized 1.6 kg weight will be much less likely to be mistaken for being lighter than the reference cube.

Is this a non-parametric scenario? I was initially tempted to compare the mode, median or mean between the four "types" of comparisons listed earlier, but I'm not confident that this qualifies as a data set where that is valid.

Is my thesis that the cube-size should have no influence on the relative weight judgment task success, that only the weight matters?

How do we measure our confidence that test results which demonstrate a skew are due to the influence of the cube-size and not a random occurrence?

I apologize if this is rather convoluted or improperly presented. I've only had one course, decades ago, in basic statistics, and am having trouble finding anything similar in design to this experiment in my old text book.

Let's say we have a two sets of cubes. In one set of 10, each cube is 2 cm square, and otherwise visually identical. In the other set of 10, each cube is 4 cm square, and otherwise visual identical. In each set, the cubes range in weighs from 1kg to 2 kg, in 0.1 increments, with the weight 1.5 kg omitted. [1, 1.1, 1.2, 1.3, 1.4, 1.6, 1.7, 1.8, 1.9, 2]

In the test, for a single round, the participant first picks up and returns to the test platform a "reference" cube of 1.5 kg. The reference cube that they are given may be either the larger or smaller size. Then they pick up a cube selected for them from one of the two sets, and return that to the test platform. Lastly, they write down whether they think the 2nd cube is lighter or heavier than the first cube. During the course of the test, the test rounds cover every possible sequence. Thus we have four types of comparisons:

- 1st cube small, 2nd cube small
- 1st cube small, 2nd cube large
- 1st cube large, 2nd cube small
- 1st cube large, 2nd cube large

In the results, let's say that when comparing like-to-like sizes, the difficulty in correctly identifying whether the second cube is lighter or heavier is shown in the data to be more difficult as the second-cube's weight gets closer to 1.5 kg. If we tabulate the "misses", the second-cubes of weights 1.4kg and 1.6kg will have the highest number of "misses", while the second-cubes of 1kg or 2kg will have the most "correct" responses.

Now, let's suppose that when we handle cubes of different sizes, our brain tricks us into thinking the larger is heavier than it actually is (by maybe 0.3 kg), and this is reflected in the responses about relatives weights. Thus the results skew to reflect the mistaken perception. For example, the large-sized 1.2 kg to 1.4 kg weights will now more often be misjudged as heavier than the small-sized 1.5 kg reference cube, and the large-sized 1.6 kg weight will be much less likely to be mistaken for being lighter than the reference cube.

Is this a non-parametric scenario? I was initially tempted to compare the mode, median or mean between the four "types" of comparisons listed earlier, but I'm not confident that this qualifies as a data set where that is valid.

Is my thesis that the cube-size should have no influence on the relative weight judgment task success, that only the weight matters?

How do we measure our confidence that test results which demonstrate a skew are due to the influence of the cube-size and not a random occurrence?

I apologize if this is rather convoluted or improperly presented. I've only had one course, decades ago, in basic statistics, and am having trouble finding anything similar in design to this experiment in my old text book.

Last edited: