Well, it’s been a month, and I finally have a free day to write up the answers to the survey! People had been asking whether they got the right answers, and since my results were anonymous, I couldn’t just look it up. That said, I figured it would be fun to make a post with some graphs.
First, here is the text of my survey.
Thank you for agreeing to help me complete my Honors project! First, please answer some demographic questions, then answer a few questions related to music samples, which will be in the form of links to Soundcloud pages. Make sure you either have headphones or are in a space where you can listen to music before beginning! If you have any questions or issues, feel free to contact me at sgoree @ oberlin.edu!
What is your age?
Under 18 years old
18-24 years old
25-39 years old
40-54 years old
55-69 years old
70 years old or older
Which of the following best identifies you?
Please identify your race/ethnicity:
Please identify your gender:
* Participants are asked the following questions four times *
For the next three questions, please base your answers on this sound sample:
* Below is an embedded SoundCloud player for one of 24 possible sound samples, three from each of the eight categories discussed in the results section *
Do you think this music was composed by J. S. Bach or a computer?
There were 24 sound samples:
(If the embedding isn’t working, you can also access them here)
The human/computer breakdown is as follows:
What does that mean? For the details on the models themselves, see a previous post. For the different generation schemes (melody-only vs. harmonization vs. fully generated), since I never wrote that post, melody only means running one model to generate a replacement voice for an existing chorale, then taking that new voice out of context. Harmonization is taking the melody from a Bach chorale (which usually is an older lutheran hymn) and using three models, one trained on each other voice, to generate the other three voices, and full generation is four models, trained on each voice, generating an entire chorale with none of Bach’s writing.
I got 244 responses, about 180 were complete. The demographic breakdowns speaks more to the departments that I sent the survey to at Oberlin and the set of people who are less than two degrees of separation from me on Facebook.
These are the answers to the music questions. Here, Q1 through Q8 are the eight different categories of samples discussed above.
I also asked people for their comments on various samples.
simple melody
product melody
Bach melody
simple harmonization
product harmonization
Bach chorale
Simple generation
Product generation
And some of my favories:
Some observations about this data copied from my paper follow.
First, There was no significant difference between the two models for melody generation or harmonization, but there was for full chorale generation: the simple model outperformed the product model. This may have been because the product model became too reliant on the spacing with another voice. In Bach’s music, all of the voices imply each other to some extent, and assuming that one voice perfectly implies another is dangerous when dealing with generated voices.
That said, we cannot conclude with any certainty that the product model is a flawed approach. Since it is a larger neural network with more parameters, it is more likely to train into a local minimum and may perform substantially better with more training data. Additional training with a larger chorale dataset may result in higher quality musical output.
Second, generated melody samples were much more difficult for participants to differentiate from real Bach than four part harmony samples.
Third, the nature of the training data inadvertently allowed the simple model to pick up aspects of tonality, since most of the pieces were in keys that had several notes in common, that the model learned to prioritize. Importantly, though, this did not restrict the model to a specific set of pitches or impose unwanted assumptions, but it did reduce the amount of dissonance in the music it composed, especially when generating from scratch.
Fourth, synthesis, or the process of creating sound electronically from scratch, played a greater role in both participants’ and our expert’s evaluation of samples than expected or intended. People had a tendency, as displayed by the comments on samples, to evaluate based on timbre (instrument sound qualities) and expressiveness rather than pitches and rhythms, and were convinced that even the Bach samples were computer-composed just because they were computer-synthesized. It is possible that many participants may have rated all of the samples higher if they had been performed by humans or more sophisticated synthesizer systems rather than just MIDI.
Finally, musical form and structure, even in a piece as short as a chorale, remains an unsolved problem. The lack of correct melodic syntax, which for a chorale means having a coherent beginning, middle and end to each phrase, was a sticking point both with survey participants and our expert for many samples, and is clearly not something LSTM networks can learn without external guidance. Future researchers may want to look into quantitative measures of consonance and dissonance and potentially train a consonance expert which is only trained on intervals’ consonance values. Since dissonance is a subjective and often culture-specific phenomenon, though, more research is necessary before a model like this could be proposed.
To all who participated, thank you very much for helping with my research! I hope you found this discussion satisfying!