User modeling assists us to predict users’ behavior and interaction. Originated user model from a user can be used in a personalized system in which the user is interacting with it, for example, applying to improve recommender systems. “Cold Start” is one of the principal challenges in user modeling, personalization and recommender systems which exists in all inner-system user modeling. This phenomenon causes sparse data on initial user data, which leads to an inaccurate forecast of the user’s behavior and then incorrect personalization and unsuitable recommendations. To overtake this problem, it is possible to use users’ public profiles on other social media accounts of his or hers. This approach is the definition of cross-system modeling. The problem we are trying to solve is retrieving metadata from the user’s public profile, which are presented on YouTube and Twitter in order to cause improvement in recommender systems personalization.
- Which features are more important in cross-system modeling on Twitter and YouTube?
- How accurately can we predict selected features using other features?
- What are the possible user models that we can plug into the features’ relationship?
We couldn’t use open-source data because of the ethical standpoint of available open-source data. All of them preserving users’ privacy, and they don’t provide real person names, addresses, or any other personal information, so we can’t look up for their other social accounts. We tackled this problem by using the most popular channels on YouTube. Using a questionnaire was not an option for our research as long as finding people with an active YouTube channel, and the Twitter account was not common for accessible participants. We have started our dataset with other dataset called “Top 5000 YouTube channels.”
Our steps to reach the ideal dataset were first adding Channel ID to the dataset. These ids helped us to crawl and fetch more data about each channel. Those data are but not limited to about page, channel’s latest videos and updates, and analytical data of YouTube on each channel. We pick out YouTube channels that had Twitter account links on their about page. Purged broken or protected accounts on Twitter and collected as much data on Twitter using Twitter API as much as we could. Example of these is Tweets, metadata on tweets, followers, and followings and metadata on them as well.
long the way of collecting such a dataset, we faced some unique challenges. Both Twitter and YouTube (Google) force a heavy restriction on their API usage for reasonable causes. That led to a more time-consuming task of testing and implementing cycle than usual. And as long as these target people in the final dataset are “internet fame,” testing on ordinary people is something that can be worked on future researches. At this point, we produced a dataset that intersected on both parties Twitter and YouTube, which is roughly 300 records.
For this research, we have only picked a few features of the entire dataset features, those that we assumed are the most important ones at the end:
- YouTube view count: Each channel on YouTube shows its total view count on the About page of that channel.
- YouTube subscriber count: Each channel on YouTube shows a total number of subscribers, people who will be notified of new content on that channel.
- YouTube uploaded video count: Total number of videos uploaded to a channel.
- Twitter follower count: Total number of people who follow a person on Twitter.
We utilized correlation heatmap to spot potential associations among picked features in our dataset. Following conclusions can be made base upon the heatmap:
- The close connection between subscriber count and total view count
- Near no-connection among uploaded video count and total view count
- Corresponding importance of Twitter follower count and YouTube subscriber count
Then we applied the regression algorithm using the Sklearn library in Python to predict on total view count of the YouTube feature. We could conclude the following results on our dataset:
- Average view count of 3,550,524,704.7
- Maximum Residual Error of 104.61
- The average absolute error of 27.27
- The average execution time of 372ms
It is feasible to use concluded data in the static and stereotype user model as an addition to available user model features and use it for recommendation and personalization. The regression algorithm is resulting in acceptable time and accuracy in comparison to the average of view count and size of the dataset. We can use Twitter follower feature instead of a YouTube subscriber count in case of mitigating the cold-start problem for a newly joined creator who is well-established on Twitter to make their content more discoverable.
- Check the content of images, videos, and texts of Tweets and videos on YouTube.
- Check YouTube links in a Tweet.
- Check based on YouTube channel classification.
- Check videos and tweets head to head.
- Add other common social media like Facebook and Instagram.
- Creating a system for collecting users’ public data for application on other social media.
This is a brief report of my master thesis titled “Cross-system social web user modeling personalization of recommender system” mainly focused on social computing between Twitter and YouTube to help YouTube creators, written originally in Persian at Shahid Beheshti University under supervision of Dr. Elaheh Homayounvala. The paper is underwriting.