Documentation for the Code¶
Community Analysis¶

lib.analysis.community.
infomap_igraph
(ig_graph, net_file_location=None)[source]¶  Performs igraphinfomap analysis on the nx graph
Parameters:  ig_graph (object) – igraph graph object
 net_file_location (str) – location to load graph from if not mentioned in ig_graph
Returns: igraph object community.membership: result of infomap community analyis
Return type: ig_graph
Channel Analysis¶

lib.analysis.channel.
build_stat_dist
(number_list)[source]¶ Summarize a list into a statistical distribution. An empty input list generates an empty output list.
Parameters: number_list (List) – List containing positive integers Returns: A tuple with two items in each element, in the (number, frequency) format Return type: rows_table(zip List)

lib.analysis.channel.
conv_len_conv_refr_time
(log_dict, nicks, nick_same_list, rt_cutoff_time, cutoff_percentile)[source]¶ Calculates the conversation length (CL) that is the length of time for which two users communicate i.e. if a message is not replied to within Response Time(RT), then it is considered as a part of another conversation. This function also calculates the conversation refresh time(CRT) For a pair of users, this is the time when one conversation ends and another one starts. :param log_dict: Dictionary of logs data created using reader.py :type log_dict: str :param nicks: list of nickname created using nickTracker.py :type nicks: List :param nick_same_list: List of same_nick names created using nickTracker.py :param rt_cutoff_time: Response Time (RT) cutoff to be used for CL and CRT calculations :type rt_cutoff_time: int
Returns: Conversation Length row_crt(zip List) :Conversation Refresh time Return type: row_cl(zip List)

lib.analysis.channel.
response_time
(log_dict, nicks, nick_same_list, cutoff_percentile)[source]¶ finds the response time of a message i.e. the best guess for the time at which one can expect a reply for his/her message.
Parameters:  log_dict (str) – Dictionary of logs data created using reader.py
 nicks (List) – List of nickname created using nickTracker.py
 nick_same_list – List of same_nick names created using nickTracker.py
 cutoff_percentile (int) – Cutoff percentile indicating statistical significance
Returns:  Response Time (This refers to the response
time of a message i.e. the best guess for the time at which one can expect a reply for his/her message)
Return type: rows_RT(zip List)

lib.analysis.channel.
truncate_table
(table, cutoff_percentile)[source]¶ The calculations of conversation characteristics, namely RT, CL and CRT, are based on the cutoff values estimated for RT and CL. This generic function takes a two column table and truncates the same to a required percentile value. Usually the RT followed by CL tables are processed through this function. cutoff_percentile (float) : Cutoff indicating the statistical significance of observations on conversation characteristics. The value is expressed as a floating point number.
Parameters: table (zip List) – List containing 2tuple elements, ex: [(0,10),(1,5)] Returns: A truncated version of table provided as input argument. The table is truncated to the level of statistical significance mentioned in the cutoff_percentile parameter. cutoff_time (int): Cutoff time value corresponding to the chosen level of statistical significance. Return type: truncated_table (zip List)
Network Analysis¶

lib.analysis.network.
channel_user_presence_graph_and_csv
(nicks, nick_same_list, channels_for_user, nick_channel_dict, nicks_hash, channels_hash)[source]¶ creates a directed graph for each nick, each edge from which points to the IRC Channels that nick has participated in. (Nick changes are tracked here and only the initial nick is shown if a user changed his nick)
Parameters:  nicks (list) – list of all the nicks
 nick_same_list (list) – list of lists mentioning nicks which belong to same users
 channels_for_user (dict) – dictionary with keys as nicks and value as list of
 on which user with nick is present (channels) –
 nick_channel_dict (dict) – channels and nicks present on them
 nicks_hash (list) – hash values of nicks
 channels_hash (list) – hash values of channels
Returns: contains adjacency matrices and graphs for Acc Auu Acu full_presence_graph (nx graph object)
Return type: presence_graph_and_matrix (dict)

lib.analysis.network.
degree_analysis_on_graph
(nx_graph, date=None, directed=True)[source]¶ perform degree analysis of input graph object
Parameters:  nx_graph (nx_object) – object to perform analysis on
 date (string) – timestamp
 directed (boolean) – True if nx_graph is directed else False
Returns:  with in_degree, out_degree & total_degree for directed graphs
and degree as key for undirected_graphs
Return type: dictionary

lib.analysis.network.
degree_node_number_csv
(log_dict, nicks, nick_same_list)[source]¶ creates two csv files having no. of nodes with a certain in and outdegree for number of nodes it interacted with, respectively. Also gives graphs for log(degree) vs log(no. of nodes) and tries to find it’s equation by curve fitting
Parameters:  log_dict (dict) – with key as dateTime.date object and value as {“data”:datalist,”channel_name”:channels name}
 nicks (list) – list of all the nicks
 nick_same_list (list) – list of lists mentioning nicks which belong to same users
Returns: out_degree (list) in_degree (list) total_degree (list)

lib.analysis.network.
filter_edge_list
(edgelist_file_loc, max_hash, how_many_top)[source]¶ reduces the edge list by selecting top nodes through degree analysis
Parameters:  edgelist_file_loc (str) – location of the edgelist file
 max_hash (int) – max possinle value of the node_hash in edgelist
 how_many_top (int) – how many top nodes to select in the new edgeList
Returns: null

lib.analysis.network.
identify_hubs_and_experts
(log_dict, nicks, nick_same_list)[source]¶  uses message_number graph to identify hubs and experts in the network
Parameters:  log_dict (dict) – with key as dateTime.date object and value as {“data”:datalist,”channel_name”:channels name}
 nicks (list) – list of all the nicks
 nick_same_list (list) – list of lists mentioning nicks which belong to same users
Returns: message number graph top_hub(list): list of top hubs top_keyword_overlap(list): top users from keywords digest top_auth: list of top authorities
Return type: message_graph(nx graph)

lib.analysis.network.
message_number_bins_csv
(log_dict, nicks, nick_same_list)[source]¶  creates a CSV file which tracks the number of message exchanged in a channel
 for 48 bins of half an hour each distributed all over the day aggragated over the year.
Parameters:  log_dict (dictionary) – Dictionary of logs data created using reader.py
 nicks (List) – List of nickname created using nickTracker.py
 nick_same_list (List) – List of same_nick names created using nickTracker.p
Returns: a list of lists of 48 bins with number of messages sent in each bin tot_msgs: total messages exchanged
Return type: bin_matrix(list of lists)

lib.analysis.network.
message_number_graph
(log_dict, nicks, nick_same_list, DAY_BY_DAY_ANALYSIS=False)[source]¶ Creates a directed graph with each node representing an IRC user and each directed edge has a weight which mentions the number messages sent and recieved by that user in the selected time frame.
Parameters:  log_dict (dict) – with key as dateTime.date object and value as {“data”:datalist,”channel_name”:channels name}
 nicks (list) – list of all the nicks
 nick_same_list (list) – list of lists mentioning nicks which belong to same users
Returns: message_number_graph (nx graph object)

lib.analysis.network.
message_time_graph
(log_dict, nicks, nick_same_list, DAY_BY_DAY_ANALYSIS=False)[source]¶ creates a directed graph where each edge denotes a message sent from a user to another user with the stamp denoting the time at which the message was sent
Parameters:  log_dict (dictionary) – Dictionary of logs data created using reader.py
 nicks (List) – List of nickname created using nickTracker.py
 nick_same_list (List) – List of same_nick names created using nickTracker.py
 DAY_BY_DAY_ANALYSIS – True if graphs are produced for each day
Returns: List of message time graphs for different days msg_time_aggr_graph: aggregate message time graph where edges are date + time when sender sends a message to receiver
Return type: msg_time_graph_list(List)
User Analysis¶

lib.analysis.user.
keywords
(log_dict, nicks, nick_same_list)[source]¶ Returns keywods for all users
Parameters:  log_dict (str) – Dictionary of logs data created using reader.py
 nicks (List) – list of nickname created using nickTracker.py
 nick_same_list – List of same_nick names created using nickTracker.py
 Returns
 keywords_filtered: filtered keywords for user user_keyword_freq_dict: dictionary for each user having keywords and their frequency user_words_dict: keywods for user nicks_for_stop_words: stop words

lib.analysis.user.
keywords_clusters
(log_dict, nicks, nick_same_list, output_directory, out_file_name)[source]¶  Uses keywords to form clusters of words post TF IDF (optional).
Parameters:  log_dict (str) – Dictionary of logs data created using reader.py
 nicks (List) – list of nickname created using nickTracker.py
 nick_same_list – List of same_nick names created using nickTracker.py
 output_directory – output directory
 out_file_name – name of output file
 Returns
 null

lib.analysis.user.
nick_change_graph
(log_dict, DAY_BY_DAY_ANALYSIS=False)[source]¶ creates a graph which tracks the nick changes of the users where each edge has a time stamp denoting the time at which the nick was changed by the user
Parameters: log_dict (str) – Dictionary of logs created using reader.py Returns: list of the day_to_day nick changes if config.DAY_BY_DAY_ANALYSIS=True or else an aggregate nick change graph for the given time period.

lib.analysis.user.
top_keywords_for_nick
(user_keyword_freq_dict, nick, threshold, min_words_spoken)[source]¶ outputs top keywords for a particular nick
Parameters:  user_keyword_freq_dict (dict) – dictionary for each user having keywords and their frequency
 nick (str) – user to do analysis on
 threshold (float) – threshold on normalised values to seperate meaningful words
 min_words_spoken (int) – threhold on the minumum number of words spoken by a user to perform analysis on
Returns: null
Utility¶

lib.util.
HACK_convert_nx_igraph
(nx_graph)[source]¶ There exist no current method to convert a nx graph to an igraph. So this is a hack which does sp. :param nx_graph: input nx_graph to be converted to igraph
Returns: converted igraph Return type: ig_graph

lib.util.
build_graphs
(nick_sender, nick_receiver, time, year, month, day, day_graph, aggr_graph)[source]¶ Parameters:  nick_sender (str) – person who has sent the message
 nick_receiver (str) – person who receives the message
 time (str) – time when message is sent
 year (str) – year when message is sent
 month (str) – month when message is sent
 day (str) – day when message is sent
 day_graph (networkx directed graph) – a single days graph to which we add edges
 aggr_graph (networkx directed graph) – a whole time spans aggregate graph to which we add edges
Returns: None

lib.util.
correctLastCharCR
(inText)[source]¶  if the last letter of the nick is ‘’ replace it by ‘CR’
 for example rohanbecomes rohanCR to avoid complications in nx because of the special char ‘’
Parameters: inText (str) – input nick, checked for ‘’ at last position Returns: updated string with ‘’ replaced by CR (if it exists) Return type: str

lib.util.
correct_nick_for_
(inText)[source]¶ last letter of nick maybe _ and this produces error in nickmatching
Parameters: inText (str) – input nick, checked for ‘_’ at last position Returns: updated string with ‘_’ removed Return type: str

lib.util.
count_number_of_users_on_channel
(nick_same_list)[source]¶ Args: nick_same_list:list of list of strings, each inner list has the aliases for the same user

lib.util.
create_connected_nick_list
(conn_comp_list)[source]¶ A function that converts each individual list member to a list

lib.util.
extend_conversation_list
(nick_sender, nick_receiver, conversation)[source]¶ A functions that takes the nick_sender and nick_reciver and add them the conversation list and increase the weight. :param nick_sender: nick of user sending a message :param nick_receiver: nick of user to whom message is being send_time :param conversation: list of nick_sender’s and nick_reciever along with number of time message shared btw them
Returns: list containg all the nick between whom messages have been shared Return type: conversation (list)

lib.util.
find_top_n_element_after_sorting
(in_list, index, reverseBool, n)[source]¶ find top n elements from a list after sorting on the basis on ‘index’ entry :param in_list: input list of list :param index: which index in entries to selectt for sorting :param reverseBool: reverse order :type reverseBool: bool :param n: select top n :type n: int

lib.util.
get_nick_representative
(nicks, nick_same_list, nick_to_compare)[source]¶ Get representative nick for a nick ( from nick same_list)

lib.util.
get_nick_sen_rec
(iter_range, nick_to_search, conn_comp_list, nick_sen_rec)[source]¶ Parameters:  iter_range (int) – length of the interval in which nick_sen_rec can be
 nick_to_search (str) –
 conn_comp_list (list) – list of connected nicks
 nick_sen_rec (str) – nick sender/receiver that we wish to find

lib.util.
get_year_month_day
(day_content)[source]¶ A generator which takes a day_content and gives the associated year, month and date associated with it
Parameters:  day_content (dictionary) –
 { –
 "log_data" – day_data,
 "auxiliary_data" –
{ “channel”: channel_name, “year”: year_iterator, “month”: month_iterator, “day”: day_iterator }
 } –
Returns: year, str:month, str:day
Return type: str

lib.util.
load_from_disk
(file_name)[source]¶ A function to load any data structure from a file using pickle module :param file_name: name of the file to be used for saving the data :return: data structure that exists in the file

lib.util.
save_to_disk
(data, file_name)[source]¶ A function to save any data structure to a file using pickle module :param data: data structure that needs to be saved to disk :param file_name: name of the file to be used for saving the data :return: null

lib.util.
splice_find
(line, search_param1, search_param2, splice_index)[source]¶ Parameters:  line (str) – a line in the day log
 search_param1 (str) – first string to search in line
 search_param2 (str) – second string to search in line
 splice_index (int) – index used to splice eg if splice_index = 3 line[3:] will give us the string from index 3 till the end.
Visualisation¶

lib.vis.
box_plot
(data, output_directory, output_file_name)[source]¶  Plots Box Plots
Parameters:  data (list) – data
 output_drectory (str) – location to save graph
 output_file_name (str) – name of the image file to be saved
Returns: null

lib.vis.
calc_plot_linear_fit
(x_in, y_in, output_directory, output_file_name)[source]¶  Calculate and plot linar fit for data
Parameters:  x_in (list of int) – x_axis data
 y_in (list of int) – y_axis data
 output_drectory (str) – location to save graph
 output_file_name (str) – name of the image file to be saved
Returns: null

lib.vis.
exponential_curve_fit_and_plot
(data, output_directory, output_file_name)[source]¶  Fit to an expontial curve and draw the xy data
Parameters:  data (list of list) – list of list representation csv data (with 2 coordinates)
 output_drectory (str) – location to save graph
 output_file_name (str) – name of the image file to be saved
Returns: curve fit variable for the equation a * np.exp(b * x) + c b (int) : curve fit variable for the equation a * np.exp(b * x) + c c (int) : curve fit variable for the equation a * np.exp(b * x) + c mse (int) : Mean Squared error from the fit
Return type: a (int)

lib.vis.
exponential_curve_fit_and_plot_x_shifted
(data, output_directory, output_file_name)[source]¶  Fit to an expontial curve and draw the xy data Also ignores the the input untill first nonzero ycoordinate and shifts the graph along y axes untill that first nonzero entry
Parameters:  data (list of list) – list of list representation csv data (with 2 coordinates)
 output_drectory (str) – location to save graph
 output_file_name (str) – name of the image file to be saved
Returns: curve fit variable for the equation a * np.exp(b * x) + c b (int) : curve fit variable for the equation a * np.exp(b * x) + c c (int) : curve fit variable for the equation a * np.exp(b * x) + c first_non_zero_index (int): amount by which the graph is shifted along y axis mse (int) : Mean Squared error from the fit
Return type: a (int)

lib.vis.
generate_log_plots
(plot_data, output_directory, output_file_name)[source]¶  Generate log plots for given time frame
Parameters:  plot_data (list of list) – data to be plotted
 output_drectory (str) – location to save graph
 output_file_name (str) – name of the image file to be saved
Returns: The slope of linear fit for the log plot. r_square : mean_sqaure_error : Mean sqaure error for best fit.
Return type: slope

lib.vis.
generate_probability_distribution
(data)[source]¶  Normalises y coordinates, dividing it by sum of all entries of y coordiantes
Parameters: data (list of list) – list of list representation csv data (with 2 coordinates) Returns: xcoordinate (list) freq (list) normalisedycoordinates

lib.vis.
matplotlob_csv_heatmap_generator
(csv_file, output_directory, output_file_name)[source]¶  Plots heatmaps for all the csv files in the given directory Can be used as a script for generating heatmaps, faster alternative to plotly
Parameters:  in_directory (str) – location of input csv files
 output_drectory (str) – location to save graph
 output_file_name (str) – name of the image file to be saved
Returns: null

lib.vis.
normal
(loc=0.0, scale=1.0, size=None)¶ Draw random samples from a normal (Gaussian) distribution.
The probability density function of the normal distribution, first derived by De Moivre and 200 years later by both Gauss and Laplace independently [2], is often called the bell curve because of its characteristic shape (see the example below).
The normal distributions occurs often in nature. For example, it describes the commonly occurring distribution of samples influenced by a large number of tiny, random disturbances, each with its own unique distribution [2].
Parameters:  loc (float or array_like of floats) – Mean (“centre”) of the distribution.
 scale (float or array_like of floats) – Standard deviation (spread or “width”) of the distribution.
 size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. If size isNone
(default), a single value is returned ifloc
andscale
are both scalars. Otherwise,np.broadcast(loc, scale).size
samples are drawn.
Returns: out – Drawn samples from the parameterized normal distribution.
Return type: ndarray or scalar
See also
scipy.stats.norm()
 probability density function, distribution or cumulative density function, etc.
Notes
The probability density for the Gaussian distribution is
\[p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }} e^{  \frac{ (x  \mu)^2 } {2 \sigma^2} },\]where \(\mu\) is the mean and \(\sigma\) the standard deviation. The square of the standard deviation, \(\sigma^2\), is called the variance.
The function has its peak at the mean, and its “spread” increases with the standard deviation (the function reaches 0.607 times its maximum at \(x + \sigma\) and \(x  \sigma\) [2]). This implies that numpy.random.normal is more likely to return samples lying close to the mean, rather than those far away.
References
[1] Wikipedia, “Normal distribution”, http://en.wikipedia.org/wiki/Normal_distribution [2] (1, 2, 3) P. R. Peebles Jr., “Central Limit Theorem” in “Probability, Random Variables and Random Signal Principles”, 4th ed., 2001, pp. 51, 51, 125. Examples
Draw samples from the distribution:
>>> mu, sigma = 0, 0.1 # mean and standard deviation >>> s = np.random.normal(mu, sigma, 1000)
Verify the mean and the variance:
>>> abs(mu  np.mean(s)) < 0.01 True
>>> abs(sigma  np.std(s, ddof=1)) < 0.01 True
Display the histogram of the samples, along with the probability density function:
>>> import matplotlib.pyplot as plt >>> count, bins, ignored = plt.hist(s, 30, normed=True) >>> plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * ... np.exp(  (bins  mu)**2 / (2 * sigma**2) ), ... linewidth=2, color='r') >>> plt.show()

lib.vis.
plot_infomap_igraph
(i_graph, membership, output_directory, output_file_name, show_edges=True, aux_data=None)[source]¶ Plots the informap community generated by igraph
Parameters:  i_graph (object) – igraph graph object
 membership (list) – membership generated by infomap.community_infomap
 output_drectory (str) – location to save graph
 output_file_name (str) – name of the image file to be saved
 show_edges (bool) – toggle to disable/enable edges during viz
Returns: null