GSOC 2015/Student Application charul: Difference between revisions

Revision as of 03:59, 27 March 2015

Project Title : Shumgrepper

Personal Information

Name : charul
Fedora Profile : charul
GitHub : charulagrl
Timezone : India, UTC +5:30

Contact Information

E-mail : charul.agrl@gmail.com
Phone : 91-8879018082
IRC nick : charul at irc.freenode.net
Blog url : https://honeycoding.wordpress.com

Why do you want to work with the Fedora Project?

I have worked on various fedora projects and had a great experience while working on them. I already had participated in GSoC last year and worked on the same project Shumgrepper and this year I would like to bring this project to its completion. Besides all this, fedora is my favorite linux distro and it gives me immense pleasure in contributing to its projects. I also met with few fedora contributors this year in a conference and I must say I had a great time with them and learned many new stuff. They are energetic folks who loves what they are doing and this inspires me a lot.

Do you have any past involvement with the Fedora project or another open source project as a contributor?

Yes, I have contributed to Datagrepper, Fedora-Packages, Shumgrepper and Summershum.

Did you participate with the past GSoC programs, if so which years, which organizations?

Yes, I participated last year i.e. in year 2014 and worked with Fedora organisation.

Will you continue contributing/ supporting the Fedora project after the GSoC 2015 program, if yes, which team(s), you are interested with?

Yes, I will keep contributing to the Fedora Project in my spare time even after the GSoC 2015 program. I would prefer contributing to more projects under Fedora-infra team as its projects completely intersect with my area of interest.

Why should we choose you over other applicants?

I have already been actively involved in contributing to fedora projects and have worked on this project last year thereby, having a good understanding of project codebase and its requirements. I am pretty much sure that this time i will be able to complete the project.

Proposal Description

Overview and The Need

Shumgrepper is a webapp which is built on top of Summershum. Summershum collects md5sum, shasum and sha256sum of every file in every package. Shumgrepper uses this information to check the integrity and duplication among different packages. It can be used to find the common or different files among various packages by comparing their sha256 values. It also let you to query files by their shum values, files bundled within a package and compare different packages and tar_files.

Dev instance: http://209.132.184.120/

Any relevant experience you have

I worked on Shumgrepper last year as a GSoC project and built UI and API for it. Before this, I had contributed to Datagrepper. Besides this, I have been writing codes in Python for more than 3 years. Also, I have built many applications in Flask, webapp2, used jinja2 template and have good experience of working as a backend developer.

How do you intend to implement your proposal

1. Database Migrations: We had made some changes in the summershum schema so that Packages list page can be rendered fast. As a first step, I would be writing an alembic migration script to update current database according to the new schema. After this, it is important to check and compare the time taken and query results.

I have started working on this and wrote alembic script to add and modify few tables. Now, I have to figure out how should I run the queries so that data import can take place.

I have created a pull request that will add a method to create new database tables.

2. Running unit-tests: It's important to run unit-tests before launching the product into production in order to minimise failures.

Recently, I wrote unit-tests for summershum to check if sqlalchemy database is working. I have pushed it here.

For shumgrepper I will be writing unit-test for every end-point both for api and ui.

3. Deployment: We can plan to deploy the very first version of shumgrepper in production after the above two steps would be completed. It will involve the following steps:

Create a rpm package. For that, I will create a .spec file which will contain all the necessary information about the software being packaged.
Add license to the package and other small steps to make it ready for packaging. Then, will check for errors in SPEC files, RPMS, SRPMS using rpmlint.
Use Mock to check if we had accurately listed build dependencies and Koji to test SRPM on other platforms.
After all this, l can create a review request on bugzilla.
After the request being approved, I can make a SCM request and upload the package to SCM and then push package to public repository.
I have created a spec file for shumgrepper.

Name: shumgrepper
Version: 0.0.1
Release: 1%{?dist}
Summary: A webapp of summershum
    
License: GPLv2+
URL: https://github.com/fedora-infra/shumgrepper
Source0: https://pypi.python.org/packages/source/d/%{name}/%{name}-%{version}.tar.gz

BuildArch: noarch
    
BuildRequires:  python2-devel
BuildRequires:  python-setuptools
 
BuildRequires:  fedmsg
BuildRequires:  python-flask
BuildRequires:  python-docutils
BuildRequires:  python-fedmsg-meta-fedora-infrastructure
BuildRequires:  m2crypto
BuildRequires:  python-m2ext
BuildRequires:  python-flask-wtf
BuildRequires:  python-summershum

Requires:  fedmsg >= 0.7.0
Requires:  python-flask
Requires:  python-docutils
Requires:  python-fedmsg-meta-fedora-infrastructure
Requires:  m2crypto
Requires:  python-m2ext
Requires:  python-flask-wtf
Requires:  python-summershum

%description
Shumgrepper is a webapp that queries from summershum's database which collects 
the md5sum, sha1sum, sha256sum of every file present in every package in 
Fedora. Shumgrepper will allow you to query by shum values like sha1sum, 
sha256sum, md5sum and tar_sum, find the files bundled within a package and 
compare different packages and tar_files.
     
%prep
%setup -q 
     
%build
%{__python} setup.py build
     
%install
%{__python} setup.py install -O1 --skip-build \
    --install-data=%{_datadir} --root %{buildroot}

mkdir -p %{buildroot}%{_datadir}/%{name}/apache/
install -m 644 apache/%{name}.wsgi %{buildroot}%{_datadir}/%{name}/apache/%{name}.wsgi

mkdir -p %{buildroot}%{_sysconfdir}/%{name}
install -m 644 apache/%{name}.cfg %{buildroot}%{_sysconfdir}/%{name}/%{name}.cfg

mkdir -p %{buildroot}%{_sysconfdir}/httpd/conf.d
install -m 644 apache/%{name}.conf %{buildroot}%{_sysconfdir}/httpd/conf.d/%{name}.conf

%files
%doc README.md LICENSE
%config(noreplace) %{_sysconfdir}/httpd/conf.d/shumgrepper.conf
%config(noreplace) %{_sysconfdir}/%{name}/%{name}.cfg
%{_datadir}/%{name}/
%{python_sitelib}/%{name}/
%{python_sitelib}/%{name}-%{version}-py%{python_version}.egg-info/
     
%changelog

4. Testing and optimisation: It has been observed that on remote server, when it comes to compare among different packages, it does so by comparing each file of one package with each and every file of other packages to find out common or different files; thereby queries take too long to return results. I need to find some ways by which we can plan to optimise these queries.

As an example, to get the common files among two tarballs i.e. fedora-release-21.tar.bz2 (I) and fedora-release-22.tar.bz2 (II) (link), it roughly takes around 58 sec to get the results.

It does so by first querying all the data of (I) and (II) tarball code and then using loops to return results.

Instead we can just put a query like this:

    SELECT   table1.filename, table2.filename, table1.sha256sum
    FROM      files table1, files table2
    WHERE    table1.tarball = 'fedora-release-21.tar.bz2' AND table2.tarball = 'fedora-release-22.tar.bz2'
      AND    table1.sha256sum ==  table2.sha256sum

5. GPL License: As we already have the information about shum values of files within packages. This can be used to find if a package is having a genuine GPL license.

6. Querying by GPL license: We can add a filter to query those packages which have a genuine GPL license. All the packages have a LICENSE file and if we know the hash values(sha1, sha256 or md5) of original license, then we can compare and find out if the package export real GPL license or not.

This will also involve adding one more attribute to the package table i.e. License which will have boolean values to specify the presence/ absence of genuine license. For this, again I will write a migration script and run database migrations.

We can also display the count of total packages having GPL license on /packages page.

7. Testing & Documentation: This will involve testing all the end-points and their results. Also documenting everything implemented so far.

It will require keeping track how much time queries are taking and trying to find more optimisations in case of excessive delays.
It will also involve maintaining the package and updating it for further changes.

8. Improving the GUI: We can improve user experiences with the app by making considerable changes in the UI. This could involve:

We can have some visualisation (in the form of bar charts) which will give an overview of the changes among different packages.
While comparing among three packages for finding different files among the three. We can provides some stats where we list the count of differences between every two packages.
On /package/filenames, it list filenames present in each package. We can give add a link to each file which will contain information specific to every file. This may include:

* Sha1sum, sha256sum and md5sum values of the file.

* No of other packages which contains that file

* Link to /filename/<filename> page.

Deliverables

Migration of current data according to new schema.
Testing, debugging and finishing off project.
Deployment of the app
Manual or Documentation

Timeline

Period	Task
May 25	Official GSoC coding period begins.
May 25 - May 31 (6 days)	Phase I(Data Migration).
June 1 - June 09 (9 days)	Writing unit-tests
June 10 - June 22 (12 days)	Deployment of app
June 23 - July 02 (10 days)	Optimisation of queries
July 03 - July 19 (17 days)	Implementing check of GPL License and querying by it
July 20 - August 3 (13 days)	Testing and documentation
August 4 - August 14 (10 days)	Improving the GUI
August 15 - August 21 (1 week)	Final phase of the project i.e. cleaning codes, documenting everything, reviewing all the functionalities and fixing bugs.
August 21	Pencil down date

@@ Line 153: / Line 153: @@
-. '''Querying by GPL license''': We can add a filter to query those packages which have a genuine GPL license. All the packages have a LICENSE file and if we know the shum value of original license, then we can compare and find out if the package export real GPL license or not.
+. '''Querying by GPL license''': We can add a filter to query those packages which have a genuine GPL license. All the packages have a LICENSE file and if we know the hash values(sha1, sha256 or md5) of original license, then we can compare and find out if the package export real GPL license or not.
 *  This will also involve adding one more attribute to the package table i.e. License which will have boolean values to specify the presence/ absence of genuine license. For this, again I will write a migration script and run database migrations.
@@ Line 170: / Line 170: @@
 * We can have some visualisation (in the form of bar charts) which will give an overview of the changes among different packages.
 * While comparing among three packages for finding different files among the three. We can provides some stats where we list the count of differences between every two packages.
+* On [http://209.132.184.120/package/fotoxx/filenames /package/filenames], it list filenames present in each package. We can give add a link to each file which will contain information specific to every file. This may include:
+:: * Sha1sum, sha256sum and md5sum values of the file.
+:: * No of other packages which contains that file
+:: * Link to /filename/<filename> page.
 ==Deliverables==

Search