python activity2

Analyze the social network of movie stars

Social network analysis is a branch of data science that allows the investigation of social structures using networks and graph theory. It can help to reveal patterns in voting preferences, aid the understanding of how ideas spread, and even help to model the spread of diseases.

A social network is made up of a set of nodes (usually people) that have links, or edges between them that describe their relationships. In this article we analyse the social network formed by movie actors. Each actor in this network is represented as a node. Pairs of actors are then joined by an edge if they are known to have appeared in a movie together. This information is taken from the Internet Movie Database IMDb. Our analysis is carried out using the Python programming language and, in particular, the tools available in the NetworkX library.

example:

{
“title”: “Back to the Future”,
“cast”: [“Michael J. Fox”, “Christopher Lloyd”, “Lea Thompson”, “Crispin Glover”, “Thomas F. Wilson”, “Claudia Wells”, “James Tolkan”, “Marc McClure”, “Wendie Jo Sperber”],
“directors”: [“Robert Zemeckis”],
“producers”: [“Bob Gale”, “Neil Canton”],
“companies”: [“Amblin Entertainment”, “Universal Pictures”],
“year”: 1985
}

Steps

  1. import packages
1
2
3
4
5
6
7
import json
import networkx as nx #used to create the network
import matplotlib . pyplot as plt
import collections
import statistics
import time
import random
  1. create a movie list
1
2
3
4
5
Movies = []
with open ("./data.json", "r", encoding ="utf -8") as f: #open the file
for line in f.readlines():
J = json.loads(line)
Movies.append(J) #add movies to the list

The list is a dictionary, so we can visit each parts by the index.

  1. search the corresponding movies that meet the requirement.

The movies can be found through the requirements like the actors.

1
2
3
4
5
6
7
8
9
10
11
12
13
def search_films(key, value): 

films = []

for i in range(len(Movies)):
if key in Movies[i]:
if value in Movies[i][key]:
#films=films+[Movies[i]["title"]] #this can be used to print the name of movies
films=films+[i]

return films

search_films('cast', 'Jackie Chan')

[26,911,1084,1103,1161,1365,1402,1793,1980,2131,2210,2394,2395,2396,2397,2398,2399,2400,2401,2412,2422,2423,2424,2425,2436,2604,3143,3406,3491,3492,3912,4277,4470,4479,5039,5040,5340,6477,6579,7248,7902,7916,7920,8087,8705,9152,9257,9555,9642,9644,9645,10260,11121,11319,11327,11523,11525,11667,11840,11975,12316,12914,13011,13012,13013,13014,15651,15727,16454,19272,21174,21513,22601,23022,24986,25142,26267,28142,28188,30893,31846,35553,35973,36505,36549,36729,38362,38444,41311,41739,44043,44107,45903,51663,51822,56326,62100,71516,72118,74710,74964,75838,75924,78144,78301,78499,83908,89644,96923,97984,99572,100993,102035,109576]

  1. graph

① an empty graph is needed to put the social network on

1
2
G = nx.Graph()  # create a new graph
nx.draw_networkx(G) # plot the graph

DP68yR.png

② .add_edge() can be used to add one more edge and .add_node can be used to add one more note by NetworkX

1
2
G.add_edge('Alice', 'Bob', title='abc') # title is a parameter that can be used to name the edge
nx.draw_networkx(G)

DP6ffg.png

③ example of complete graph:

1
2
3
4
5
6
7
8
9
10
11
G.add_edge('Alice', 'Bob', title='abc')
G.add_edge('Alice', 'Cindy', title='abc')
G.add_edge('Cindy', 'Bob', title='abc')
G.add_edge('Cindy', 'Dale', title='def')
G.add_edge('Dale', 'Eric', title='def')
G.add_edge('Frank', 'Dale', title='def')
G.add_edge('Grace', 'Helen', title='ghi')
G.add_node('Iric', title='jkl')

plt.figure(figsize=(8,6))
nx.draw_shell(G, with_labels=True)

DP6OtU.png

nx.shortest_path(G, source, target) can be used to show the shortest path between two actors or actresses. .degree(v) can be used to calculate how many edge a node have, so that the number of relationship of he or she can be known.

  1. the final graph
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
G = nx.Graph()
a=0
for i in range(len(Movies)):
if "companies" in Movies[i]:
if 'Marvel Studios' in Movies[i]["companies"] and "Iron Man" in Movies[i]["title"]:
for j in range(0,len(Movies[i]["cast"])):
for n in range(j+1,len(Movies[i]["cast"])):
if a==0:
G.add_edge(Movies[i]["cast"][j], Movies[i]["cast"][n], title=Movies[i]["title"], color="#DA70D6")
elif a==1:
G.add_edge(Movies[i]["cast"][j], Movies[i]["cast"][n], title=Movies[i]["title"], color="#87CEFA")
else:
G.add_edge(Movies[i]["cast"][j], Movies[i]["cast"][n], title=Movies[i]["title"], color="#00FF00") #we can use different colours for different films
a=a+1


displayG = Network(width=1024, height=768, notebook = True, heading="Iron Man") # 创建 pyvis 中的 Network 对象,可以设置长度、宽度、是否在当前笔记本中显示等

displayG.from_nx(G) # 将 NetWorkX 的图导入 pyvis

displayG.show('test.html') # 展示,并可生成一个名为 "test.html" 的网页文件

DEkes0.png

  1. calculate the centrality
1
2
3
print(nx.closeness_centrality(G))
print(nx.degree_centrality(G))
print(nx.betweenness_centrality(G))

{‘Robert Downey Jr.’: 1.0, ‘Terrence Howard’: 0.5769230769230769, ‘Jeff Bridges’: 0.5769230769230769, ‘Shaun Toub’: 0.5769230769230769, ‘Gwyneth Paltrow’: 1.0, ‘Don Cheadle’: 0.8333333333333334, ‘Scarlett Johansson’: 0.625, ‘Sam Rockwell’: 0.625, ‘Mickey Rourke’: 0.625, ‘Samuel L. Jackson’: 0.625, ‘Guy Pearce’: 0.6818181818181818, ‘Rebecca Hall’: 0.6818181818181818, ‘Stéphanie Szostak’: 0.6818181818181818, ‘James Badge Dale’: 0.6818181818181818, ‘Jon Favreau’: 0.6818181818181818, ‘Ben Kingsley’: 0.6818181818181818}

{‘Robert Downey Jr.’: 1.0, ‘Terrence Howard’: 0.26666666666666666, ‘Jeff Bridges’: 0.26666666666666666, ‘Shaun Toub’: 0.26666666666666666, ‘Gwyneth Paltrow’: 1.0, ‘Don Cheadle’: 0.8, ‘Scarlett Johansson’: 0.4, ‘Sam Rockwell’: 0.4, ‘Mickey Rourke’: 0.4, ‘Samuel L. Jackson’: 0.4, ‘Guy Pearce’: 0.5333333333333333, ‘Rebecca Hall’: 0.5333333333333333, ‘Stéphanie Szostak’: 0.5333333333333333, ‘James Badge Dale’: 0.5333333333333333, ‘Jon Favreau’: 0.5333333333333333, ‘Ben Kingsley’: 0.5333333333333333}

{‘Robert Downey Jr.’: 0.23333333333333342, ‘Terrence Howard’: 0.0, ‘Jeff Bridges’: 0.0, ‘Shaun Toub’: 0.0, ‘Gwyneth Paltrow’: 0.23333333333333342, ‘Don Cheadle’: 0.0761904761904762, ‘Scarlett Johansson’: 0.0, ‘Sam Rockwell’: 0.0, ‘Mickey Rourke’: 0.0, ‘Samuel L. Jackson’: 0.0, ‘Guy Pearce’: 0.0, ‘Rebecca Hall’: 0.0, ‘Stéphanie Szostak’: 0.0, ‘James Badge Dale’: 0.0, ‘Jon Favreau’: 0.0, ‘Ben Kingsley’: 0.0}

Thoughts

As the Internet is built and the world becomes more associated, people are all connected because of different kinds of events. So the relationship between them may be complex. In order to find the the social net and who is more central in this social net, using networkx is a good way as it can help to built the relationship graph and calculate some necessary value like centrality. We can also use this to find the relationship between other objects, not only people. For example, we can draw different road to get to the destination by using the graph, so that the shortest way can be found. This can be realize by BFS(breath-first search). You can find more details through the link:

https://arya-1017.github.io/2020/07/13/%E3%80%8A%E7%AE%97%E6%B3%95%E5%9B%BE%E8%A7%A3%E3%80%8B%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B02/#more