Python Assignment

In this assignment you will practice writing code to analyze more complex data structures using dictionaries. In particular you will perform some simple network analysis. Networks (what are called “graphs” in mathematics) can be used to represent a wide variety of systems, from social connections (on social media or otherwise) to organization process flows. In this assignment, you’ll be analyzing the network of hyperlinks between articles on Wikipedia.

In computer programming, a network is represented by a set of nodes (elements in the network—articles on Wikipedia, people on a social network, etc), and edges (connections between the nodes—links between articles, friendships or conversations between people, etc). Analyzing the distribution of nodes and edges in a network—what is connected to what—can provide valuable insights into the structure and behaviors of elements in that system. While there are existing packages that can be used to model and analyze networks, you’ll be doing some of that work using the basic data structures (lists and dictionaries) you’ve learned so far. But you will practice using external libraries to produce graphical visualizations of your analysis results to help communicate them!

Part 1. Reading the Data

The data you’ll be working with comes from the Stanford Network Analysis Project collection on Wikispeedia navigation paths. This data set was created by looking at the “paths” (hyperlinks followed) by humans as they tried to navigate between Wikipedia articles as quickly as possible. We’re using a portion of that data (the articles visited and the hyperlinks between them), cleaned slightly so that article names are more readable.

Note that this data is from a number of years ago (first published in 2009), and is not a complete list of all Wikipedia links—just the ones that were used by people navigating the site in this particular context. Nevertheless, it will provide a sufficiently complex network for you to analyze.

You will consider this network where the articles (the different entries on Wikipedia) are nodes, and the hyperlinks between them are edges

You will be reading the data from a .tsv file: a plain-text data format where each line represents a record (row) of data and where features (column) are separated by a tab (written as \t in code). The data can be found in the data/links_decoded.tsv file. Be careful about opening it up directly, even this trimmed-down collection is pretty big (a 3.1mb file).

  1. Create a variable data_links_lines which is a list of the contents in the data/links_decoded.tsv file included with the assignment (each line in the file represents a single link between Wikipedia pages). You will need to open up the file (remember the with keyword!) and read its contents into a list. After you create the list, print out its length to see how many items it has—this is a good way to confirm that it was loaded correctly.
  • Inspect the data file by printing out the first 15 lines of the file (you can just print the first 15 elements of data_links_lines), each on its own line. Use the end argument to the print() function to avoid printing an extra line break. What do you notice about the file contents?
  • Notice that the first lines of the file aren’t actually data, but are information about the data! You’ll need to “remove” this header data for processing. Create a new variable data_links_lines_cleaned that contains all of the elements except the header. You can do this using a slice operator. Print out the first 5 elements of data_links_lines_cleaned once you are done to show that the list has been cleaned.

After reading the file, you need to get it into a format that can be used. The file is structured where each link is written on its own line. The first term on that line is the “source” of the link (where the user navigated from) and the second term on that line is the “target” of the link (where the user navigated to). These terms are separated by a tab (“\t”).

4. Create a variable links_list which is a list of dictionaries. Each dictionary in the list will have two keys: “source” whose value is the source of the link (the first term in the line), and “target” whose value is the target of the link. For example, the first link will be structured as {“source”: “Áedán_mac_Gabráin”, “target”: “Bede”}.

You will use a similar process as in the previous assignment to create this list. Remember to split the terms by using “\t” as a separator. You will also need to strip() the line break at the end.

When finished, print out the first 5 elements of the links_list variable to check that your data structure is correct. This will give you a list of links to work with!

Part 2. Node Connectivity

The first set of analysis you will perform will be looking at the connectivity of the nodes (articles) in the data set—which articles are connected/linked to other articles. This kind of analysis can help to identify “central” and “outlier” nodes in a network, as well as an understand of the network’s overall structure.

In particular, you will be considering the degree of each node in the network: meaning how many edges (hyperlinks) are connected to that node. This is technically a directed network, in that each edge in the data set indicates a hyperlink in a single direction—a link from Article A to Article B doesn’t mean that there’s necessarily a link from Article B back to Article A. So you will be considering the in-degree (how many links go “to” that article) and the out-degree (how many links go “from” that article) separately, as well as the total degree (the total number of links both in and out).

  • The first step will be to calculate the degree count of each node. Create a new variable node_degree_dict. This variable will be a dictionary where the keys are the names of each node, and the values are another dictionary containing an “in”, “out” key with the values that are the in-degree and out-degree numbers respectively. Thus, your variable will have the format:

{

    ‘Áedán_mac_Gabráin’: {‘out’: 11}

    ‘Bede’: {‘in’: 27, ‘out’: 12}

    ‘Columba’: {‘in’: 13, ‘out’: 10}

    ‘Dál_Riata’: {‘in’: 14, ‘out’: 18}

    ‘Great_Britain’: {‘in’: 180, ‘out’: 35}

}

Yes, this is a dictionary of dictionaries! That means you’ll need multiple levels of access.

To create this dictionary, think about looping through all of the links, adding 1 to the “in” or “out” of the appropriate dictionaries. This will be a similar process to the count/frequency calculations you’ve done before. Remember that each link will need to add to the “in” of one node and the “out” of another.

  • Tip: Throughout this assignment, I recommend including data type in your variable names (e.g., variable_string, variable_dict, etc). This can help a lot in keeping track of all of the pieces of this complex data structure!
  • Be careful about printing out the entire dictionary (there are 4592 unique nodes!). But you can use the above example to check some specific values.
  • While not required, it can be useful to make sure that every value dictionary has both an “in” and an “out” key, even if the values at those keys are 0. This will allow you use bracket notation instead of .get() calls more commonly throughout the assignment.
  • What is are the in- and out-degrees for the article “Rome”? Access the appropriate element in your node_degree_dict and print out its value.
  • Which article has the highest in-degree, and which has the highest out-degree (these could be the same article!)? Use a linear search to calculate determine your answer, and print out the names of each article and its degrees. You can use two different searching loops to find these values, or do both in the same loop.
  • This is another way of asking “which article shows up the most as a target, and which shows up most as a source”?
  • There are actually many articles that share the same minimum total degree (the total of its in- and out- degrees). Use a filter to produce a list of all article names that have a total degree value of 1. You can use a comprehension or a basic loop to do this filtering.
  • Define a function called get_nodes_with_min_degree(). This function will expect a number as an argument, and return a list of article names whose total degree is equal to or greater than that argument.

You can test your function by calling it; there are 20 nodes whose total degrees are at least 500.

  1. Call your get_nodes_with_min_degree function and print out all of the nodes that have a total degree of at least 500. In a comment (with #), write a few sentences considering what you notice about these articles. For example, what do they have in common? What might that suggest about information seeking behavior on Wikipedia?

Part 3. Visualizing Node Connectivity

One of the best ways to communicate results is to do it visually, through plots, charts, or other visualizations. Luckily, Python includes a number of highly effective plotting libraries you can use to make visualizations! For this step you will use the matplotlib library, which is the standard and foundational package for creating visualizations in Python.

The below cell will import the matplotlib package (calling it plt, which is the conventional nickname).

  1. The first question you’ll visualize is “what is the distribution of nodes with a particular degree?” (e.g., “are there lots of highly-connected nodes, or lots of lowly-connected nodes?”). You can do this using a scatter plot, plotting each node by its in-degree and out-degree.

To make this chart, you’ll need to first produce an appropriate data structure for the ploting library to draw. In this case, you’ll need two lists—one of all the in-degree values (called in_degree_list), and one of all the out-degree values (called out_degree_list).

  • You can produce these lists by mapping the values from the node_degree_dict. This is a great place for a list comprehension (one for each new list)!
  • Importantly these lists will need to be parallel, so that the in-degree and out-degree of a node are at the same index of each list (e.g., “United_States” in-degree might be at in_degree_list[42], which means its out-degree would need to be at out_degree_list[42]—the indexes need to match!). This should happen naturally as long as you use a similar process for making each degree list.
  1. Now you can draw your plot! To do this, use the plt.scatter() function. You will need to pass this function two named arguments: an x for what values go on the x-axis (your in_degree_list), and a y for what values go on the y-axis (which will be your out_degree_list).

Additionally, matplotlib supports a huge range of different options for customizing and styling plots! Search the documentation for how to change e.g., the color of the dots (adding a different edgecolor around them makes them more visible), or make them a different size or shape. You should make at least one visual/aesthetic change to the default plot based on your own research and aesthetic preferences (but you can customize this as much as you wish)!

In addition to your visual/color/size change, also be sure to add labels for each axis (x and y) as well as a title for the chart. You can also find details on that in online documentation.

  1. As you can see from your visualization, most of the dots are clustered in a particular range. To get more detail on how these are distributed, plot the count of nodes with a particular total degree as a histogram.

Again, the first step will be to prepare the data. In this case, you just need a list that contains all of the different total degree values for each node (matplotlib will handle the counting on its own!). Name this list total_degree_list.

  • You can produce this list using a very similar process to how you created the in_degree_list and out_degree_list!
  • Because there are some extreme outliers (like the “United_States”), you should also filter this list to only contain total degrees that are less than 500. You can do this filtering at the same time you do the mapping (all in one list comprehension!)
  1. To draw the plot, use the plt.hist() function. Pass in your total_degree_list as the first argument, and an additional argument bins indicating how many different “groups” the counts should be put into (20 is a good default).
  • You can also adjust the color and edgecolor, or otherwise customize your plot appearance!
  • Be sure to give your plot a title and labels for the x and y axes (the y axis could be “number of articles”)
  1. In a comment (with #), write a few sentences considering what you notice about the distribution of articles based on this visualization. What does it tell you about how many connections articles tend to have?

Part 4. Network Paths

The data you’re analyzing was built around understanding the paths that users took when navigating from one Wikpedia article to another. In this section, you will analyze the path distance between articles — that is, how many links it takes to move between them.

+ Code+ Text

In order to more reasonably calculate the path lengths in the network, you will use another package called networkx. This package is used to do network analysis (and indeed can help answer many of the questions from the previous sections if you used it). But since this assignment is about practicing with dictionaries, you’ll just use it to do some initial calculations (and some plotting in the end). You must not use this library for any question in this assignment that does not explicitly instruct you to.

The below cell will import the networkx package (calling it nx, which is the conventional nickname).

  1. Create a networkx-style data structure (called a DiGraph) to represent the network graph. There are two steps to doing this:
  2. Create a variable links_tuple_list which is a list of tuples where each tuple represents a link (the first element of the tuple is the “source” and the second element is the “target”). You can do this by mapping the links_list list-of-dictionaries into a new list-of-tuples. This is a great place for a list comprehension!
  3. call the nx.DiGraph() function, passing in your links_tuple_list as an argument. This function will return a new value (which you can call e.g., graph) that is the DiGraph variable representing the network.

Now that you have a DiGraph value, you can go ahead and calculate the shortest path distance between each node! There are many fascinating algorithms for determining this, though coding them goes beyond the scope of this course. The nx.all_pairs_shortest_path_length() will calculate these path distances… but the limitations of the Ed platform means that the network is too large to compute them all.

To address this, you’ll create a sub-graph (a portion of the whole graph) that contains a random sampling of the articles, and then analyze the path distances between those articles. While this won’t be the full network for this assignment, the code you write could easily be applied to the full data set if there was sufficient computer memory.

17. Begin by getting a random sample of 500 articles. You can do this by using the random.sample() method (remember to import the random library). You will need to sample the full list of article names—you can get this list by accessing the list of keys of the node_degree_dict. Save this list as sample_node_list.

  • To make the random output repeatable and the rest of the instructions to work, use a randomization seed value of 2614, specified by calling random.seed(2614) before you sample the list.

18.Define a new variable sample_graph which will be a new DiGraph that is a subgraph containing only the nodes you sampled. You can do this by calling the .subgraph() method on your DiGraph value (the graph), passing in the sample_node_list as an argument. This method will return the new DiGraph which you can assign to a variable.

19. Now that you have a subgraph, calculate the distances between each node by calling the nx.all_pairs_shortest_path_length() function, passing in your sample_graph value as an argument. Convert the result of this function into a dictionary (using the dict()) function, and save that in a new variable called path_lengths_dict.

  • Note that the function can take a few seconds to run—calculating the shortest path is a slow algorithm!

The path_lengths_dict is a dictionary of dictionaries: each key in the “outer” dictionary is the name of an article, and the value is a dictionary of the path distance (how many links) are needed to get to each other article. For example:

{

  ‘Áedán_mac_Gabráin’: {‘Áedán_mac_Gabráin’: 0, ‘Monarchy’: 1, ‘Picts’: 1, ‘Great_Britain’: 1, … }

  …

}

Indicates that that that the path distance from ‘Áedán_mac_Gabráin’ to itself is 0 (it takes 0 links to move between those articles), the path distance to ‘Monarchy’ is 1 (it takes 1 link to move between those articles), etc.

  • Importantly, any articles that do not have a path between them are not included.

20. Access the appropriate value in this dictionary to print the path distance from “China” to “Rome” (it should be 2)

  • If either article isn’t in your sample graph so doesn’t have a distance, make sure you’ve used the correct random seed value!

21. In the subgraph, which two (connected) articles have the longest distance between them? Print out the article titles as well as the distance (in the event of a tie, print any one pair of articles with that distance–you don’t need them all).

  • You’ll need to use a linear search, for a nested data structure—that means a nested loop!

22. Since the subgraph was randomly sampled, are there any articles that are disconnected—that is, there is no path to or from that node from any other node in the sample? Create a new variable called disconnected_nodes which is a list of the article names in the subgraph that are disconnected. Print this variable.

  • This is another linear search!
  • Remember that each article in the path dictionary notes a path length of 0 to itself (this does not mean it is connected).
  • You will need to check both directions: that a node isn’t the source of a path, but also that it isn’t a target from another node.
  • This is a filter, so technically could be done with a list comprehension… but I’d recommend using basic loop structure for clarity and debugging.

23. Define a function called get_neighborhood(). This function will expect two arguments: the name of an article (a string) and a path distance (a number). The function will return a list of all articles that are within that distance of the argument article (based on the existing path_lengths_dict). So calling get_neighborhood(“Seattle,_Washington”, 3) will return a list of all articles that have a shortest path distance that is less than or equal to 3 from the article “Seattle,_Washington”. This distance should be considered in both directions: so articles that can get to Seattle in 3 steps, and articles you can get to from Seattle in 3 steps

24. Call your get_neighborhood() function, passing in an article title (such as “Seattle,_Washington”) and a small path distance (3 is a good value). Print out the size of the resulting neighborhood (the number of nodes in the returned list)

One last small step: one of the funnest things to do with a network is to visualize it! Luckily, we can easily do that with the networkx package.

It isn’t possible to distinguish between all 4952 articles in a single visualization, or even all 500 nodes in the sample graph. Instead, you should just visualize a smaller piece (another subgraph!) of the network — just as a single neighborhood!

25. Call your get_neighborhood() function to get a small neighborhood of an article with a small total degree count. I found the neighborhood around “Bird_of_prey” to be fairly readable, but you can pick another article from your sample graph if you wish. Start with a small neighborhood (2 or 3 steps) and then grow it out to see what you get.

Once you have your neighborhood, call the .subgraph() method on your sample_graph value, passing in the neighborhood list of nodes as an argument. This method will return another a new DiGraph; you can assign it to a variable e.g., neighborhood_graph.

26. To create the visualization, call the nx.draw_spring() function, passing in the subgraph as an argument. Also pass in a with_labels=True argument to have the article names show up.

  • You can make the figure larger by including line of code plt.figure(figsize=(WIDTH, HEIGHT)), replacing WIDTH and HEIGHT with your desired width and height (in inches).
  • Yes the visualization may look messy if its not small enough, but that’s okay—one of the challenges of working with networks!
Click here to order similar paper @Udessaywriters.com.100% Original.Written from scratch by professional writers.

You May Also Like

About the Author: admin