I watch quite a lot of Korean dramas. It is my one and only addiction. My addiction is on a level of “Let me learn Korean because translation to English just does not work. The culture is not right!”. I made some progress reading Hangul, completed a Coursera course, and have not spoken a word in Korean yet. Despite my slow progress, there is one word that I have known since… day 1. Which word is it? Surprise it’s “oppa”.
Oppa is used by women when referring to older men (not too old, then it becomes a different word). It is also a word used to describe the affection of a woman towards a man. If as a woman you are friendly with another man, just old friends, you might still be using oppa. I am not an expert in Korean culture, but it should give you an idea how widely the word oppa is used. While as a woman, I would be calling older men “oppa”, I wondered what I would be called by others. How men refer to older women? The answer is “Noona”. While it did not strike me back then, I think the word oppa was used much more in dramas unlike noona, hence I have learned it much later, not because I am straight and interested in unrealistically gorgeous oppas in those dramas. Don’t get the wrong idea! But is it true, though?
Is it true that “oppa” is more widely used than “noona” or do I have a bias towards “oppa”? What could be the reason that I learned “oppa” first. The joy of data science is that I could gather the data required to answer this question. So, let’s get solid data and see whether oppa was really used much more than noona.
Gathering information to answer this question requires web scraping (unless you know a database with subtitles in Korean) and language processing. The first task requires minimal knowledge of HTML and CSS (Lucky! The small website maintenance tasks I did will pay off.). The second one requires Korean knowledge (I have heard 78d 6h 19m worth of Korean, I picked up a few things.). I have done neither of those before, but my awesome “Googling” skills comes to the rescue.
Web scraping turned out to be frustrating. The websites I have utilized do not have API support for non-official developers. Hence, I had to rely on the information parsed from the HTML, and I had to login to access most of it, I have also risked being banned, oops! I have used two libraries for this task: requests, and selenium. If login was required, often captcha was involved, and I did not want to bypass it. My solution was to keep a tab logged in, manually, at all times with selenium. After that, each following url was opened as a new tab and closed after the information was obtained. Since I do not want to request information all the time from these sources, I have saved the most important information for further usage along with the subtitle files.
I had all the subtitles I could get my hands on. Now what? Do I simply search for oppa and noona with Hangul characters? No, my friends, language processing is not that simple. Korean is an agglutinative language. If I search for an exact match for oppa with spaces, I will miss many other occurrences of it. Since there could be words using the similar characters but has nothing to do with oppa, I cannot simply grab every word with those characters in it. At first, I have utilized a machine learning algorithm, called my brain. I have found all the unique words containing characters from oppa and noona, then I have studied those forms to come up with a plan. I have deduced from that list that oppa could be in the beginning of a word, or it could come after a name. It could have been merged with topic markers, other suffixes, or words. You can see the endings I have extracted, including the topic markers in the github file, they may not be grammatically correct, but it was a start. I got to learn more about Korean along the way as a bonus. Since I was not satisfied with what I have done, I have searched for text processing libraries and came across KoNLPy (Korean Natural Language Processing Python package). I used this package to assess my own results.
Out of 105 dramas I have seen, 40 had ~30% or more subtitle information available. Anything below that was excluded from the analysis. With the algorithm I came up with for matching the words, I counted oppa 1731 times and noona 1168. Using Mecab method from KoNLPy, oppa was counted 1915 times and noona 1176. Numbers are pretty close, I have done a good job with my method. I could do a deeper comparison of the results to improve my Korean knowledge.
All in all, it seems I do have an oppa bias on the dramas I watch, for every 3 “noona” there are 5 “oppa”. What about the dramas I gave a score of 10, which is a rare sight, what are their counts: The bias is shifted. It turns out I tend to like dramas more if there are noonas, 603 vs 577. The drama that has the most noona and oppa is Reply 1988. I did not have the data for other Reply series, but I suspect oppa count will be higher due to the concept of those. The drama that had the most bias towards oppa was ‘Secret Garden’ which got a score of 6.0 from me, on the other hand ‘W’ had the most bias towards noona got a score of 10.0.
From these results I conclude that dramas I have seen are more inclined towards oppa scenes. It could be true that I have learned “oppa” first because it is more widely used. But since I enjoy dramas with a higher “noona” content. I feel safe to strike my oppa bias, with noona bias.
This was a fun little project for me. I have started it for mOC Data, actually, but the idea turned into something else, which I will tell you all about mOC Data and a little bit about what it turned into in my next post, “Ideas vs Products”. Look at me, announcing the title of my next post before it is published! What a long way I have come.