The project consists in a three .py files, each one containing a different part:
-
part_1.py consists in a program able to:
- Read the files;
- Provide a registry of analytical operations that can be performed on the dataset and the link to the classes created in the part_2.py;
- Links to part_3.py.
-
part_2.py consists in a program containing the classes and their associated methods for:
- Recording the numerical metadata consisting of the number of rows and columns of each of the two files composing the data collection;
- Recording the general semantics of the dataset, i.e. the labels of the columns of each of the two files composing the data collection;
- Recording the number of different genes detected in literature. The list should be sorted in ascending order;
- Given a gene symbol, provide the list of sentences that are an evidence in literature about the relation of this gene with COVID-19;
- Recording the number of different disease detected in literature. The list should be sorted in ascending order;
- Given a disease ID, provide the list of sentences that are an evidence in literature about the relation of this disease with COVID-19;
- Recording the 10-top most frequent distinct association between genes and diseases;
- Given a gene symbol, provide the list of diseases such a gene is associated with;
- Given a disease name, provide the list of genes such a disease is associated with.
-
part_3.py consists in a program which implements the Web-based user interface (UI).
We have provided a folder with CRC cards explaining each class built in part_2.py, each CRC card has been created with Visual Paradigm.
We have provided a folder with an UML diagram explaining the connections between the classes built in part_2.py:
- DiagramProject.jpg is an image representing the UML diagram created with Visual Paradigm.
- UMLproject_final.vpp is the equivalent of DiagramProject.jpg but in format .vpp.
It is a folder with several templates used in order to implement the file part_3.py to create an HTML web page.
The three parts will analyze the DisGeNET COVID-19 data collection. The .tsv files can be downloaded clicking onto the following links:
Several libraries have been used:
import pandas as pd
from flask import Flask, render_template, requestIn order to run the program, make sure to have them installed. If not, we have provided some tutorials to do so:
Once running the part_3.py, the homepage will be available at the following link.
Corona Gaia, Storari Samuele, Verdesca Laura Claudia.
